mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-26 06:01:49 +00:00
* feat(skills): add osint-investigation optional skill (closes #355) Phase-1 public-records OSINT investigation framework adapted from ShinMegamiBoson/OpenPlanter (MIT). Lives in optional-skills/research/. Six data-source wiki entries (FEC, SEC EDGAR, USAspending, Senate LD, OFAC SDN, ICIJ Offshore Leaks), each following the 9-section template: summary, access, schema, coverage, cross-reference keys, data quality, acquisition, legal, references. Six stdlib-only acquisition scripts that emit normalized CSV, plus three analysis scripts: - entity_resolution.py — three-tier match (exact / fuzzy / token overlap) with explicit confidence per row - timing_analysis.py — permutation test for donation/contract timing correlation, joins through cross-links - build_findings.py — assembles structured findings.json with evidence chains pointing back to source rows Validation: full pipeline runs end-to-end on synthetic fixtures. Entity resolution found 24 cross-matches with 0 false positives on a 5-row / 4-row test set. Timing analysis on 5 donations clustered near 3 awards returned p=0.000, effect size 2.41 SD. Findings JSON correctly tags HIGH-severity timing pattern. All 9 scripts pass --help and py_compile. Docs site page auto-generated by website/scripts/generate-skill-docs.py; sidebar + catalog entries updated by the same generator. * fix(osint-investigation): live API fixes from end-to-end sweep Live-tested the skill on a real public-citizen query and found three bugs the synthetic E2E missed. All three are now fixed and re-verified. 1. FEC fetch hung on contributor name searches. The combination of two_year_transaction_period + sort=date + contributor_name puts the OpenFEC query plan on a slow path that the upstream gateway times out (25s+). Switched to min_date/max_date with no explicit sort. Renamed --candidate to --contributor (the original name was misleading: FEC searches by donor, not by candidate; --candidate is kept as a deprecated alias). Added --state filter for narrowing. 2. ICIJ Offshore Leaks reconcile endpoint returns 404. ICIJ removed the Open Refine reconciliation API. Rewrote fetch_icij_offshore.py to download the official bulk CSV ZIP (~70 MB, public, no auth) and search it locally. Cached under $HERMES_OSINT_CACHE/icij/ (default ~/.cache/hermes-osint/icij/) for 30 days, --force-refresh to refetch. Verified live: 'PUTIN' query returns 5 Panama Papers officer matches in 0.5s after first download. 3. SEC EDGAR silently returned 0 when the company-name resolver matched an individual Form 3/4/5 filer (insider trading disclosures). Now surfaces 'Resolved company X → CIK Y (Z)' on stderr, prints a filing-type histogram when the type filter wipes results, and explicitly warns when the matched CIK appears to be an individual filer rather than a corporate registrant. Bonus: _http.py was retrying 429 responses with exponential backoff plus honoring (often-missing) Retry-After headers, which compounded into multi-second hangs per page when the upstream key was over quota. Changed to fail-fast on 429 with a clear, actionable error showing the upstream's quota message. Verified: 0.3s fast-fail vs the previous 60s hang on DEMO_KEY rate-limit exhaustion. Updated SKILL.md, fec.md, and icij-offshore.md to match the new CLI flags and ICIJ bulk-cache flow. Regenerated the docusaurus page via website/scripts/generate-skill-docs.py. Live sweep results across all 6 sources for 'Dillon Rolnick, New York': - OFAC SDN: 0 matches ✓ (correctly not sanctioned) - USAspending: 0 matches ✓ (correctly not a federal contractor) - Senate LDA: 0 matches ✓ (correctly not a lobbying client) - SEC EDGAR: warns it resolved to 'Rolnick Michael' (CIK 0001845264) who is an individual Form 3 filer, not a corporate registrant - ICIJ: 0 matches ✓ (correctly not in any offshore leak) - FEC: rate-limited (DEMO_KEY); fails fast with clear quota message * feat(osint-investigation): expand to 12 sources covering identity, property, courts, archives, news Phase-2 expansion per Teknium feedback that the original 6-source skill (federal financial/regulatory only) wasn't a complete OSINT toolkit. Adds 6 more sources covering the major omissions a real investigation would reach for first. New sources (6 fetch scripts + 6 wiki entries): 1. NYC ACRIS — Real property records (deeds, mortgages, liens) via the city's Socrata API. Search by party name or property address. Joins Parties to Master to populate doc_type, dates, borough, and amount. Coverage: 5 NYC boroughs, ~70M party records, 1966-present. 2. OpenCorporates — Global corporate registry covering 130+ jurisdictions (~200M companies). Free API token at https://opencorporates.com/api_accounts/new raises the rate limit; HTML fallback works without one (limited fields). 3. CourtListener (Free Law Project) — federal + state court opinions (~10M back to colonial era) + PACER dockets via RECAP. Anonymous v4 search works; COURTLISTENER_TOKEN raises rate limits. 4. Wayback Machine CDX — historical web captures (~900B+). Used both for surveillance-of-record (when did this site change?) and as a content-recovery layer when other sources point to dead URLs. 5. Wikipedia + Wikidata — narrative bio + structured facts. Wikipedia OpenSearch for article matching, REST summary for extracts, Wikidata Action API (wbgetentities) for claims. Avoids the SPARQL Query Service which is aggressively rate-limited. 6. GDELT 2.0 DOC API — global news monitoring in 100+ languages, ~2015-present. Auto-retries with 6s backoff on the standard 1-req-per-5-sec throttle. Other changes in this commit: - SEC EDGAR no longer raises SystemExit when the company-name resolver finds no CIK; writes an empty CSV with header so the rest of a pipeline can keep moving and the warning is just on stderr. - _http.py User-Agent updated per Wikimedia policy: includes app name, version, and a 'set HERMES_OSINT_UA to identify yourself' instruction. - SKILL.md workflow now groups sources into two clusters (federal financial vs identity/property/courts/archives/news) with bash examples for each. 'When to use this skill' lists the broader set of investigation patterns the expanded sources unlock. Live sweep results on 'Dillon Rolnick, New York' across all 12 sources: ofac ✓ 0 (correctly clean) icij ✓ 0 (correctly not in any leak) usaspending ✓ 0 (correctly not a federal contractor) senate_lda ✓ 0 (correctly not a lobbying client) sec_edgar ✓ 0, warns: resolved to 'Rolnick Michael' (CIK 0001845264), individual Form 3 filer, NOT a corporate registrant fec — rate-limited (DEMO_KEY exhausted), fails fast with clear quota message nyc_acris ✓ 200 records named Rolnick across NYC; 48 records at 571 Hudson (the property the web identifies as his) opencorporates ✓ 0 (no API token configured; HTML fallback) courtlistener ✓ 0 for 'Dillon Rolnick'; 20 for 'Rolnick' generally; 5 for 'Microsoft' sanity check wayback ✓ 30 captures of nousresearch.com from 2011-present wikipedia ✓ 0 (correctly not notable enough); Bill Gates sanity returns full structured facts (occupation, employer, DOB, place of birth, country) gdelt ✓ 0 for 'Dillon Rolnick'; 5 for 'Nous Research' All 17 scripts compile clean and pass --help. Synthetic analysis pipeline regression still passes (entity_resolution 30 matches, timing p=0.000, findings 2). * feat(osint-investigation): remove FEC; DEMO_KEY rate-limits make it unreliable The FEC fetcher consistently failed the live sweep because the OpenFEC DEMO_KEY tier (40 calls/hour) exhausts on a single investigation, and the upstream returns slow-path query plans for unindexed contributor-name searches that the gateway times out. Without a real API key it's not usable; with one the user has to sign up at api.data.gov first. That's too much setup friction for a skill that should work out of the box. Removed: - scripts/fetch_fec.py - references/sources/fec.md Updated: - SKILL.md frontmatter description + tags - 'When NOT to use' now points users at https://www.fec.gov/data/ for federal donations - entity_resolution example switched from donor↔contractor to lobbying-client↔contractor (Senate LDA + USAspending pair) - timing_analysis example switched to lobbying-filings vs awards - 8 wiki entries had their 'FEC ↔ ...' cross-reference bullets removed 11 sources remain (5 federal financial + 6 identity/property/courts/ archives/news). All scripts compile, pass --help, and the synthetic analysis pipeline still passes on the new lobbying-shaped regression fixture (30 matches, p=0.000 on tight clustering, 2 findings).
This commit is contained in:
parent
d725407c56
commit
5f91b1a48b
32 changed files with 4567 additions and 0 deletions
|
|
@ -0,0 +1,82 @@
|
|||
"""Tiny stdlib HTTP helper used by fetch_*.py scripts.
|
||||
|
||||
Provides polite retry + JSON convenience + User-Agent enforcement.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import time
|
||||
import urllib.error
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
|
||||
DEFAULT_UA = (
|
||||
"hermes-osint-investigation/0.2 "
|
||||
"(+https://github.com/NousResearch/hermes-agent; "
|
||||
"set HERMES_OSINT_UA env var to identify yourself per "
|
||||
"Wikimedia / SEC fair-use guidance)"
|
||||
)
|
||||
|
||||
|
||||
def get(
|
||||
url: str,
|
||||
*,
|
||||
params: dict | None = None,
|
||||
headers: dict | None = None,
|
||||
user_agent: str | None = None,
|
||||
max_retries: int = 3,
|
||||
backoff: float = 1.5,
|
||||
timeout: float = 30.0,
|
||||
) -> bytes:
|
||||
"""GET with retry on 5xx and Retry-After honoring.
|
||||
|
||||
429 (rate-limit) is raised IMMEDIATELY with a clear message — retrying
|
||||
when the upstream says "you're over quota" just wastes time. The caller
|
||||
should slow down or supply real credentials.
|
||||
"""
|
||||
if params:
|
||||
sep = "&" if "?" in url else "?"
|
||||
url = f"{url}{sep}{urllib.parse.urlencode(params)}"
|
||||
h = {"User-Agent": user_agent or os.environ.get("HERMES_OSINT_UA", DEFAULT_UA)}
|
||||
if headers:
|
||||
h.update(headers)
|
||||
|
||||
last_err: Exception | None = None
|
||||
for attempt in range(max_retries + 1):
|
||||
req = urllib.request.Request(url, headers=h)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
return resp.read()
|
||||
except urllib.error.HTTPError as e:
|
||||
if e.code == 429:
|
||||
# Surface immediately. Read the body so the caller sees the
|
||||
# provider's actual message ("OVER_RATE_LIMIT" etc.).
|
||||
try:
|
||||
body = e.read(2048).decode("utf-8", errors="replace")
|
||||
except Exception: # noqa: BLE001
|
||||
body = ""
|
||||
raise RuntimeError(
|
||||
f"HTTP 429 rate-limited by {urllib.parse.urlsplit(url).netloc}. "
|
||||
f"Slow down or supply a real API key. Body: {body[:300]}"
|
||||
) from e
|
||||
if e.code in (500, 502, 503, 504) and attempt < max_retries:
|
||||
retry_after = e.headers.get("Retry-After") if e.headers else None
|
||||
wait = float(retry_after) if (retry_after and retry_after.isdigit()) else backoff ** (attempt + 1)
|
||||
time.sleep(wait)
|
||||
last_err = e
|
||||
continue
|
||||
raise
|
||||
except urllib.error.URLError as e:
|
||||
if attempt < max_retries:
|
||||
time.sleep(backoff ** (attempt + 1))
|
||||
last_err = e
|
||||
continue
|
||||
raise
|
||||
if last_err:
|
||||
raise last_err
|
||||
raise RuntimeError("unreachable")
|
||||
|
||||
|
||||
def get_json(url: str, **kwargs) -> dict | list:
|
||||
return json.loads(get(url, **kwargs).decode("utf-8"))
|
||||
|
|
@ -0,0 +1,67 @@
|
|||
"""Shared entity-name normalization helpers (stdlib-only).
|
||||
|
||||
Used by entity_resolution.py and timing_analysis.py.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
# Legal suffixes / corporate boilerplate to strip during normalization.
|
||||
_SUFFIX_TOKENS = {
|
||||
"INC", "INCORPORATED", "LLC", "LLP", "LP", "LTD", "LIMITED",
|
||||
"CORP", "CORPORATION", "CO", "COMPANY",
|
||||
"GROUP", "GRP", "HOLDINGS", "HOLDING",
|
||||
"PARTNERS", "ASSOCIATES",
|
||||
"INTERNATIONAL", "INTL",
|
||||
"ENTERPRISES", "ENTERPRISE",
|
||||
"SERVICES", "SERVICE", "SVCS",
|
||||
"SOLUTIONS", "MANAGEMENT", "MGMT", "CONSULTING",
|
||||
"TECHNOLOGY", "TECHNOLOGIES", "TECH",
|
||||
"INDUSTRIES", "INDUSTRY",
|
||||
"AMERICA", "AMERICAN",
|
||||
"USA", "US",
|
||||
"PLLC", "PC",
|
||||
"TRUST", "FOUNDATION",
|
||||
}
|
||||
|
||||
_PUNCT_RE = re.compile(r"[^\w\s]")
|
||||
_WS_RE = re.compile(r"\s+")
|
||||
|
||||
|
||||
def normalize_name(name: str | None) -> str:
|
||||
"""Standard normalization: uppercase, strip suffixes, drop punctuation."""
|
||||
if not name:
|
||||
return ""
|
||||
s = _PUNCT_RE.sub(" ", name.upper())
|
||||
s = _WS_RE.sub(" ", s).strip()
|
||||
tokens = [t for t in s.split() if t and t not in _SUFFIX_TOKENS]
|
||||
return " ".join(tokens)
|
||||
|
||||
|
||||
def normalize_aggressive(name: str | None) -> str:
|
||||
"""Aggressive normalization: sorted unique tokens (word-bag)."""
|
||||
base = normalize_name(name)
|
||||
if not base:
|
||||
return ""
|
||||
return " ".join(sorted(set(base.split())))
|
||||
|
||||
|
||||
def name_tokens(name: str | None, min_len: int = 4) -> set[str]:
|
||||
"""Token set used for overlap matching."""
|
||||
base = normalize_name(name)
|
||||
if not base:
|
||||
return set()
|
||||
return {t for t in base.split() if len(t) >= min_len}
|
||||
|
||||
|
||||
def token_overlap_ratio(left: str | None, right: str | None) -> tuple[float, int]:
|
||||
"""Return (jaccard-like ratio, shared token count) over min-len tokens."""
|
||||
a = name_tokens(left)
|
||||
b = name_tokens(right)
|
||||
if not a or not b:
|
||||
return 0.0, 0
|
||||
shared = a & b
|
||||
if not shared:
|
||||
return 0.0, 0
|
||||
union = a | b
|
||||
return len(shared) / len(union), len(shared)
|
||||
|
|
@ -0,0 +1,221 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Build a structured findings.json with evidence chains (stdlib-only).
|
||||
|
||||
Aggregates cross_links.csv (entity_resolution output) and an optional
|
||||
timing.json (timing_analysis output) into a single evidence-chain document.
|
||||
|
||||
Output structure:
|
||||
{
|
||||
"metadata": {...},
|
||||
"findings": [
|
||||
{
|
||||
"id": "F0001",
|
||||
"title": "...",
|
||||
"severity": "HIGH|MEDIUM|LOW",
|
||||
"confidence": "high|medium|low",
|
||||
"summary": "...",
|
||||
"evidence": [
|
||||
{"source": "cross_links.csv", "row": 12, "fields": {...}},
|
||||
...
|
||||
],
|
||||
"sources": ["cross_links.csv", "timing.json"]
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
Every finding traces to specific source rows. No naked claims.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
CONFIDENCE_ORDER = {"high": 0, "medium": 1, "low": 2}
|
||||
SEVERITY_ORDER = {"HIGH": 0, "MEDIUM": 1, "LOW": 2}
|
||||
|
||||
|
||||
def _read_cross_links(path: str) -> list[dict[str, str]]:
|
||||
with open(path, newline="", encoding="utf-8") as fh:
|
||||
return list(csv.DictReader(fh))
|
||||
|
||||
|
||||
def build_findings(
|
||||
cross_links_path: str,
|
||||
timing_path: str | None = None,
|
||||
out_path: str = "findings.json",
|
||||
bundled_threshold: int = 3,
|
||||
) -> dict:
|
||||
findings: list[dict] = []
|
||||
next_id = 1
|
||||
|
||||
# 1. Match-based findings, grouped by (left_normalized, right_normalized).
|
||||
matches = _read_cross_links(cross_links_path)
|
||||
grouped: dict[tuple[str, str], list[dict[str, str]]] = defaultdict(list)
|
||||
for i, row in enumerate(matches):
|
||||
row["__row__"] = str(i)
|
||||
grouped[(row.get("left_normalized", ""), row.get("right_normalized", ""))].append(row)
|
||||
|
||||
for (left_norm, right_norm), rows in grouped.items():
|
||||
if not left_norm or not right_norm:
|
||||
continue
|
||||
# Use the highest-confidence match for the finding's overall confidence.
|
||||
best = min(rows, key=lambda r: CONFIDENCE_ORDER.get(r.get("confidence", "low"), 2))
|
||||
finding_id = f"F{next_id:04d}"
|
||||
next_id += 1
|
||||
evidence = [
|
||||
{
|
||||
"source": "cross_links.csv",
|
||||
"row": int(r["__row__"]),
|
||||
"fields": {
|
||||
"match_type": r.get("match_type", ""),
|
||||
"confidence": r.get("confidence", ""),
|
||||
"left_name": r.get("left_name", ""),
|
||||
"right_name": r.get("right_name", ""),
|
||||
"overlap_ratio": r.get("overlap_ratio", ""),
|
||||
"shared_tokens": r.get("shared_tokens", ""),
|
||||
},
|
||||
}
|
||||
for r in rows
|
||||
]
|
||||
findings.append(
|
||||
{
|
||||
"id": finding_id,
|
||||
"title": f"Entity match: {best.get('left_name', '')} ↔ {best.get('right_name', '')}",
|
||||
"severity": "MEDIUM" if best.get("confidence") == "high" else "LOW",
|
||||
"confidence": best.get("confidence", "low"),
|
||||
"summary": (
|
||||
f"{len(rows)} cross-link record(s) tie "
|
||||
f"'{best.get('left_name', '')}' to "
|
||||
f"'{best.get('right_name', '')}' "
|
||||
f"(best tier: {best.get('match_type', '')})."
|
||||
),
|
||||
"evidence": evidence,
|
||||
"sources": ["cross_links.csv"],
|
||||
}
|
||||
)
|
||||
|
||||
# 2. Bundled-donations findings (if cross_links carries donor↔candidate pattern).
|
||||
# Heuristic: many distinct left names sharing the same right name.
|
||||
by_right: dict[str, set[str]] = defaultdict(set)
|
||||
by_right_rows: dict[str, list[dict[str, str]]] = defaultdict(list)
|
||||
for r in matches:
|
||||
right = r.get("right_normalized", "")
|
||||
left_raw = r.get("left_name", "").strip()
|
||||
if right and left_raw:
|
||||
by_right[right].add(left_raw)
|
||||
by_right_rows[right].append(r)
|
||||
for right_norm, lefts in by_right.items():
|
||||
if len(lefts) < bundled_threshold:
|
||||
continue
|
||||
rows = by_right_rows[right_norm]
|
||||
right_raw = rows[0].get("right_name", "")
|
||||
findings.append(
|
||||
{
|
||||
"id": f"F{next_id:04d}",
|
||||
"title": f"Bundled cross-links: {len(lefts)} distinct left entities ↔ '{right_raw}'",
|
||||
"severity": "HIGH",
|
||||
"confidence": "medium",
|
||||
"summary": (
|
||||
f"{len(lefts)} distinct left-side entities link to "
|
||||
f"'{right_raw}'. Pattern suggests coordinated relationship "
|
||||
f"(e.g. bundled donations, multi-vendor employer)."
|
||||
),
|
||||
"evidence": [
|
||||
{
|
||||
"source": "cross_links.csv",
|
||||
"row": int(r.get("__row__", "0")),
|
||||
"fields": {
|
||||
"left_name": r.get("left_name", ""),
|
||||
"match_type": r.get("match_type", ""),
|
||||
},
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
"sources": ["cross_links.csv"],
|
||||
}
|
||||
)
|
||||
next_id += 1
|
||||
|
||||
# 3. Timing-based findings.
|
||||
if timing_path and Path(timing_path).exists():
|
||||
timing = json.loads(Path(timing_path).read_text())
|
||||
for r in timing.get("results", []):
|
||||
if not r.get("significant"):
|
||||
continue
|
||||
findings.append(
|
||||
{
|
||||
"id": f"F{next_id:04d}",
|
||||
"title": (
|
||||
f"Donation timing significantly clusters near awards: "
|
||||
f"{r['donor']} ↔ {r['recipient']}"
|
||||
),
|
||||
"severity": "HIGH" if r["p_value"] < 0.01 else "MEDIUM",
|
||||
"confidence": "medium",
|
||||
"summary": (
|
||||
f"Mean nearest-award distance {r['observed_mean_days']} days "
|
||||
f"(null {r['null_mean_days']} days). p={r['p_value']}, "
|
||||
f"effect size {r['effect_size_sd']} SD. "
|
||||
f"{r['n_donations']} donations, {r['n_award_dates']} awards."
|
||||
),
|
||||
"evidence": [
|
||||
{
|
||||
"source": "timing.json",
|
||||
"row": None,
|
||||
"fields": r,
|
||||
}
|
||||
],
|
||||
"sources": ["timing.json"],
|
||||
}
|
||||
)
|
||||
next_id += 1
|
||||
|
||||
# Sort: severity → confidence → id.
|
||||
findings.sort(
|
||||
key=lambda f: (
|
||||
SEVERITY_ORDER.get(f["severity"], 3),
|
||||
CONFIDENCE_ORDER.get(f["confidence"], 3),
|
||||
f["id"],
|
||||
)
|
||||
)
|
||||
|
||||
payload = {
|
||||
"metadata": {
|
||||
"n_findings": len(findings),
|
||||
"cross_links_path": cross_links_path,
|
||||
"timing_path": timing_path,
|
||||
"bundled_threshold": bundled_threshold,
|
||||
},
|
||||
"findings": findings,
|
||||
}
|
||||
Path(out_path).write_text(json.dumps(payload, indent=2))
|
||||
return payload
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--cross-links", required=True)
|
||||
p.add_argument("--timing", help="Optional timing.json from timing_analysis.py")
|
||||
p.add_argument("--out", default="findings.json")
|
||||
p.add_argument(
|
||||
"--bundled-threshold",
|
||||
type=int,
|
||||
default=3,
|
||||
help="Minimum distinct left entities to flag as bundled (default 3)",
|
||||
)
|
||||
a = p.parse_args()
|
||||
|
||||
payload = build_findings(
|
||||
cross_links_path=a.cross_links,
|
||||
timing_path=a.timing,
|
||||
out_path=a.out,
|
||||
bundled_threshold=a.bundled_threshold,
|
||||
)
|
||||
print(f"Wrote {payload['metadata']['n_findings']} findings to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,228 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Cross-source entity resolution (stdlib-only).
|
||||
|
||||
Given two CSV files with name columns, find candidate matches using three
|
||||
tiers of normalization:
|
||||
|
||||
1. exact — normalized strings equal
|
||||
2. fuzzy — sorted-token (word-bag) match
|
||||
3. token_overlap — >=60% Jaccard overlap on >=4-char tokens, >=2 shared
|
||||
|
||||
Adapted from ShinMegamiBoson/OpenPlanter (MIT) but generalized: no Boston-
|
||||
specific record types, no contribution-code filters, no fixed schemas.
|
||||
|
||||
Output CSV columns:
|
||||
match_type, confidence, left_name, right_name,
|
||||
left_normalized, right_normalized, left_row, right_row,
|
||||
overlap_ratio, shared_tokens
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
# Allow running directly or as a module.
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _normalize import ( # noqa: E402
|
||||
normalize_name,
|
||||
normalize_aggressive,
|
||||
token_overlap_ratio,
|
||||
)
|
||||
|
||||
CONFIDENCE = {
|
||||
"exact": "high",
|
||||
"fuzzy": "medium",
|
||||
"token_overlap": "low",
|
||||
}
|
||||
|
||||
|
||||
def _read_csv(path: str, name_col: str) -> list[dict[str, str]]:
|
||||
rows = []
|
||||
with open(path, newline="", encoding="utf-8") as fh:
|
||||
reader = csv.DictReader(fh)
|
||||
if name_col not in (reader.fieldnames or []):
|
||||
raise SystemExit(
|
||||
f"Column {name_col!r} not in {path}. "
|
||||
f"Available: {reader.fieldnames}"
|
||||
)
|
||||
for i, row in enumerate(reader):
|
||||
row["__row__"] = str(i)
|
||||
rows.append(row)
|
||||
return rows
|
||||
|
||||
|
||||
def _build_index(rows: list[dict[str, str]], name_col: str):
|
||||
"""Index by exact-normalized and aggressive (sorted-token) form."""
|
||||
exact: dict[str, list[dict[str, str]]] = {}
|
||||
aggressive: dict[str, list[dict[str, str]]] = {}
|
||||
for row in rows:
|
||||
raw = row.get(name_col, "")
|
||||
n = normalize_name(raw)
|
||||
if n:
|
||||
exact.setdefault(n, []).append(row)
|
||||
a = normalize_aggressive(raw)
|
||||
if a:
|
||||
aggressive.setdefault(a, []).append(row)
|
||||
return exact, aggressive
|
||||
|
||||
|
||||
def _emit(
|
||||
out_rows: list[dict[str, str]],
|
||||
seen: set[tuple],
|
||||
match_type: str,
|
||||
left_row: dict[str, str],
|
||||
right_row: dict[str, str],
|
||||
left_col: str,
|
||||
right_col: str,
|
||||
ratio: float = 0.0,
|
||||
shared: int = 0,
|
||||
):
|
||||
left_raw = left_row.get(left_col, "")
|
||||
right_raw = right_row.get(right_col, "")
|
||||
key = (
|
||||
left_row["__row__"],
|
||||
right_row["__row__"],
|
||||
match_type,
|
||||
)
|
||||
if key in seen:
|
||||
return
|
||||
seen.add(key)
|
||||
out_rows.append(
|
||||
{
|
||||
"match_type": match_type,
|
||||
"confidence": CONFIDENCE[match_type],
|
||||
"left_name": left_raw,
|
||||
"right_name": right_raw,
|
||||
"left_normalized": normalize_name(left_raw),
|
||||
"right_normalized": normalize_name(right_raw),
|
||||
"left_row": left_row["__row__"],
|
||||
"right_row": right_row["__row__"],
|
||||
"overlap_ratio": f"{ratio:.3f}" if ratio else "",
|
||||
"shared_tokens": str(shared) if shared else "",
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
def resolve(
|
||||
left_path: str,
|
||||
left_col: str,
|
||||
right_path: str,
|
||||
right_col: str,
|
||||
out_path: str,
|
||||
overlap_threshold: float = 0.60,
|
||||
min_shared: int = 2,
|
||||
skip_overlap: bool = False,
|
||||
) -> int:
|
||||
left_rows = _read_csv(left_path, left_col)
|
||||
right_rows = _read_csv(right_path, right_col)
|
||||
|
||||
right_exact, right_aggressive = _build_index(right_rows, right_col)
|
||||
|
||||
out_rows: list[dict[str, str]] = []
|
||||
seen: set[tuple] = set()
|
||||
|
||||
# Pass 1+2: exact / fuzzy via index lookup.
|
||||
for lrow in left_rows:
|
||||
raw = lrow.get(left_col, "")
|
||||
n = normalize_name(raw)
|
||||
if not n:
|
||||
continue
|
||||
for rrow in right_exact.get(n, []):
|
||||
_emit(out_rows, seen, "exact", lrow, rrow, left_col, right_col)
|
||||
a = normalize_aggressive(raw)
|
||||
if a:
|
||||
for rrow in right_aggressive.get(a, []):
|
||||
_emit(out_rows, seen, "fuzzy", lrow, rrow, left_col, right_col)
|
||||
|
||||
if not skip_overlap:
|
||||
# Pass 3: token overlap (O(N*M) — expensive; allow opt-out).
|
||||
for lrow in left_rows:
|
||||
l_raw = lrow.get(left_col, "")
|
||||
if not normalize_name(l_raw):
|
||||
continue
|
||||
for rrow in right_rows:
|
||||
ratio, shared = token_overlap_ratio(
|
||||
l_raw, rrow.get(right_col, "")
|
||||
)
|
||||
if ratio >= overlap_threshold and shared >= min_shared:
|
||||
_emit(
|
||||
out_rows,
|
||||
seen,
|
||||
"token_overlap",
|
||||
lrow,
|
||||
rrow,
|
||||
left_col,
|
||||
right_col,
|
||||
ratio=ratio,
|
||||
shared=shared,
|
||||
)
|
||||
|
||||
fieldnames = [
|
||||
"match_type",
|
||||
"confidence",
|
||||
"left_name",
|
||||
"right_name",
|
||||
"left_normalized",
|
||||
"right_normalized",
|
||||
"left_row",
|
||||
"right_row",
|
||||
"overlap_ratio",
|
||||
"shared_tokens",
|
||||
]
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
writer = csv.DictWriter(fh, fieldnames=fieldnames)
|
||||
writer.writeheader()
|
||||
writer.writerows(out_rows)
|
||||
return len(out_rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--left", required=True, help="Left CSV path")
|
||||
p.add_argument(
|
||||
"--left-name-col", required=True, help="Name column in left CSV"
|
||||
)
|
||||
p.add_argument("--right", required=True, help="Right CSV path")
|
||||
p.add_argument(
|
||||
"--right-name-col",
|
||||
required=True,
|
||||
help="Name column in right CSV",
|
||||
)
|
||||
p.add_argument("--out", required=True, help="Output CSV path")
|
||||
p.add_argument(
|
||||
"--overlap-threshold",
|
||||
type=float,
|
||||
default=0.60,
|
||||
help="Jaccard overlap threshold for token_overlap tier (default 0.60)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--min-shared",
|
||||
type=int,
|
||||
default=2,
|
||||
help="Minimum shared tokens for token_overlap tier (default 2)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--skip-overlap",
|
||||
action="store_true",
|
||||
help="Skip the O(N*M) token_overlap pass (much faster on large CSVs)",
|
||||
)
|
||||
args = p.parse_args()
|
||||
|
||||
count = resolve(
|
||||
left_path=args.left,
|
||||
left_col=args.left_name_col,
|
||||
right_path=args.right,
|
||||
right_col=args.right_name_col,
|
||||
out_path=args.out,
|
||||
overlap_threshold=args.overlap_threshold,
|
||||
min_shared=args.min_shared,
|
||||
skip_overlap=args.skip_overlap,
|
||||
)
|
||||
print(f"Wrote {count} match rows to {args.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,149 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search court records via CourtListener (Free Law Project).
|
||||
|
||||
Covers ~10M federal and state court opinions, plus PACER docket data
|
||||
where available. Public REST API v4 supports anonymous read access for
|
||||
search; some endpoints require a token (free at courtlistener.com).
|
||||
|
||||
Set COURTLISTENER_TOKEN to authenticate (raises rate limits).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import os
|
||||
import sys
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get_json # noqa: E402
|
||||
|
||||
BASE = "https://www.courtlistener.com/api/rest/v4/search/"
|
||||
|
||||
COLUMNS = [
|
||||
"case_name",
|
||||
"court",
|
||||
"court_id",
|
||||
"date_filed",
|
||||
"docket_number",
|
||||
"judge",
|
||||
"citation",
|
||||
"result_type",
|
||||
"snippet",
|
||||
"absolute_url",
|
||||
]
|
||||
|
||||
SEARCH_TYPES = {
|
||||
"opinions": "o", # Court opinions
|
||||
"dockets": "r", # PACER dockets (may require auth depending on coverage)
|
||||
"oral": "oa", # Oral arguments
|
||||
"people": "p", # Judges / people
|
||||
"recap": "r", # Same as dockets in v4
|
||||
}
|
||||
|
||||
|
||||
def fetch(
|
||||
query: str,
|
||||
search_type: str,
|
||||
court: str | None,
|
||||
date_from: str | None,
|
||||
date_to: str | None,
|
||||
token: str | None,
|
||||
limit: int,
|
||||
out_path: str,
|
||||
) -> int:
|
||||
type_code = SEARCH_TYPES.get(search_type, search_type)
|
||||
params = {
|
||||
"q": query,
|
||||
"type": type_code,
|
||||
}
|
||||
if court:
|
||||
params["court"] = court
|
||||
if date_from:
|
||||
params["filed_after"] = date_from
|
||||
if date_to:
|
||||
params["filed_before"] = date_to
|
||||
headers = {"Authorization": f"Token {token}"} if token else None
|
||||
|
||||
rows: list[dict[str, str]] = []
|
||||
next_url: str | None = f"{BASE}?{urllib.parse.urlencode(params)}"
|
||||
while next_url and len(rows) < limit:
|
||||
try:
|
||||
payload = get_json(next_url, headers=headers)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"CourtListener error: {e}", file=sys.stderr)
|
||||
break
|
||||
if not isinstance(payload, dict):
|
||||
break
|
||||
results = payload.get("results", [])
|
||||
for r in results:
|
||||
if len(rows) >= limit:
|
||||
break
|
||||
rows.append(
|
||||
{
|
||||
"case_name": r.get("caseName", "") or r.get("case_name", "") or "",
|
||||
"court": r.get("court", "") or "",
|
||||
"court_id": r.get("court_id", "") or "",
|
||||
"date_filed": (r.get("dateFiled", "") or r.get("date_filed", "") or "")[:10],
|
||||
"docket_number": r.get("docketNumber", "") or r.get("docket_number", "") or "",
|
||||
"judge": r.get("judge", "") or "",
|
||||
"citation": "; ".join(r.get("citation", []) or []) if isinstance(r.get("citation"), list) else (r.get("citation") or ""),
|
||||
"result_type": search_type,
|
||||
"snippet": (r.get("snippet", "") or "").replace("\n", " ")[:500],
|
||||
"absolute_url": (
|
||||
f"https://www.courtlistener.com{r.get('absolute_url', '')}"
|
||||
if r.get("absolute_url", "").startswith("/")
|
||||
else r.get("absolute_url", "")
|
||||
),
|
||||
}
|
||||
)
|
||||
next_url = payload.get("next")
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
if not rows:
|
||||
print(
|
||||
f"CourtListener: 0 results for type={search_type!r} q={query!r}. "
|
||||
"Most private individuals don't appear in published court records "
|
||||
"unless they were party to a federal or state appellate case.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--query", required=True, help="Search query (party name, case name, keyword)")
|
||||
p.add_argument(
|
||||
"--type",
|
||||
default="opinions",
|
||||
choices=list(SEARCH_TYPES.keys()),
|
||||
help="Search type (default: opinions)",
|
||||
)
|
||||
p.add_argument("--court", help="Court ID filter (e.g. 'nysd' = SDNY, 'scotus' = Supreme Court)")
|
||||
p.add_argument("--date-from", help="Filed-after date YYYY-MM-DD")
|
||||
p.add_argument("--date-to", help="Filed-before date YYYY-MM-DD")
|
||||
p.add_argument("--token", default=os.environ.get("COURTLISTENER_TOKEN"))
|
||||
p.add_argument("--limit", type=int, default=100)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(
|
||||
query=a.query,
|
||||
search_type=a.type,
|
||||
court=a.court,
|
||||
date_from=a.date_from,
|
||||
date_to=a.date_to,
|
||||
token=a.token,
|
||||
limit=a.limit,
|
||||
out_path=a.out,
|
||||
)
|
||||
print(f"Wrote {n} CourtListener rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,162 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search the GDELT 2.0 DOC API for news mentions.
|
||||
|
||||
GDELT monitors world news in 100+ languages and indexes the full text.
|
||||
Free, anonymous, ~15-minute update frequency. Covers ~2015→present.
|
||||
|
||||
Useful for surfacing news mentions of a person, company, or topic across
|
||||
international media — much wider net than Google News.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get_json # noqa: E402
|
||||
|
||||
BASE = "https://api.gdeltproject.org/api/v2/doc/doc"
|
||||
|
||||
COLUMNS = [
|
||||
"title",
|
||||
"url",
|
||||
"seen_date",
|
||||
"domain",
|
||||
"language",
|
||||
"source_country",
|
||||
"tone",
|
||||
"social_image",
|
||||
]
|
||||
|
||||
|
||||
def fetch(
|
||||
query: str,
|
||||
mode: str,
|
||||
timespan: str | None,
|
||||
start_datetime: str | None,
|
||||
end_datetime: str | None,
|
||||
source_country: str | None,
|
||||
source_lang: str | None,
|
||||
limit: int,
|
||||
out_path: str,
|
||||
) -> int:
|
||||
params: dict[str, str] = {
|
||||
"query": query,
|
||||
"mode": mode,
|
||||
"format": "json",
|
||||
"maxrecords": str(min(limit, 250)),
|
||||
"sort": "datedesc",
|
||||
}
|
||||
if timespan:
|
||||
params["timespan"] = timespan
|
||||
if start_datetime:
|
||||
params["startdatetime"] = start_datetime.replace("-", "").replace(":", "").replace(" ", "")
|
||||
if end_datetime:
|
||||
params["enddatetime"] = end_datetime.replace("-", "").replace(":", "").replace(" ", "")
|
||||
if source_country:
|
||||
params["sourcecountry"] = source_country
|
||||
if source_lang:
|
||||
params["sourcelang"] = source_lang
|
||||
|
||||
url = f"{BASE}?{urllib.parse.urlencode(params)}"
|
||||
payload: dict | list = {}
|
||||
for attempt in range(3):
|
||||
try:
|
||||
payload = get_json(url)
|
||||
break
|
||||
except RuntimeError as e:
|
||||
# GDELT requires 1 request per 5 seconds; back off and retry.
|
||||
if "429" in str(e) and attempt < 2:
|
||||
print(
|
||||
f"GDELT throttle hit; sleeping 6s before retry "
|
||||
f"(attempt {attempt + 1}/3)",
|
||||
file=sys.stderr,
|
||||
)
|
||||
time.sleep(6)
|
||||
continue
|
||||
print(f"GDELT error: {e}", file=sys.stderr)
|
||||
payload = {}
|
||||
break
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"GDELT error: {e}", file=sys.stderr)
|
||||
payload = {}
|
||||
break
|
||||
|
||||
rows: list[dict[str, str]] = []
|
||||
if isinstance(payload, dict):
|
||||
articles = payload.get("articles", []) or []
|
||||
for a in articles[:limit]:
|
||||
seen = (a.get("seendate") or "")
|
||||
# GDELT format: 20260319T083000Z → 2026-03-19 08:30:00Z
|
||||
if len(seen) == 16 and "T" in seen:
|
||||
seen = f"{seen[0:4]}-{seen[4:6]}-{seen[6:8]} {seen[9:11]}:{seen[11:13]}:{seen[13:15]}Z"
|
||||
rows.append(
|
||||
{
|
||||
"title": (a.get("title") or "").replace("\n", " ").strip(),
|
||||
"url": a.get("url") or "",
|
||||
"seen_date": seen,
|
||||
"domain": a.get("domain") or "",
|
||||
"language": a.get("language") or "",
|
||||
"source_country": a.get("sourcecountry") or "",
|
||||
"tone": str(a.get("tone") or ""),
|
||||
"social_image": a.get("socialimage") or "",
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
if not rows:
|
||||
print(
|
||||
f"GDELT: 0 articles for query={query!r}. "
|
||||
"GDELT indexes ~2015→present. Try widening the timespan or "
|
||||
"checking the query syntax (https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/).",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--query", required=True, help='Search query (supports GDELT operators: quoted phrases, AND/OR/NOT, sourcecountry:, theme:)')
|
||||
p.add_argument(
|
||||
"--mode",
|
||||
default="ArtList",
|
||||
choices=["ArtList", "ImageCollage", "TimelineVol", "TimelineTone", "ToneChart"],
|
||||
help="GDELT mode (default ArtList for article list)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--timespan",
|
||||
help="Relative window: e.g. '1d', '1w', '1m', '3m', '1y' (overrides start/end)",
|
||||
)
|
||||
p.add_argument("--start", help="Absolute start YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS")
|
||||
p.add_argument("--end", help="Absolute end YYYY-MM-DD or YYYY-MM-DDTHH:MM:SS")
|
||||
p.add_argument("--source-country", help="2-letter source country (e.g. US, UK)")
|
||||
p.add_argument("--source-lang", help="Source language (e.g. English, Spanish)")
|
||||
p.add_argument("--limit", type=int, default=100)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(
|
||||
query=a.query,
|
||||
mode=a.mode,
|
||||
timespan=a.timespan,
|
||||
start_datetime=a.start,
|
||||
end_datetime=a.end,
|
||||
source_country=a.source_country,
|
||||
source_lang=a.source_lang,
|
||||
limit=a.limit,
|
||||
out_path=a.out,
|
||||
)
|
||||
print(f"Wrote {n} GDELT article rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,234 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search ICIJ Offshore Leaks via the bulk CSV database.
|
||||
|
||||
The old reconcile endpoint (https://offshoreleaks.icij.org/reconcile) returns
|
||||
404 — ICIJ has removed it. The remaining stable access path is the public
|
||||
bulk download:
|
||||
|
||||
https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip
|
||||
|
||||
~70 MB, ~6 CSVs inside (nodes-entities, nodes-officers, nodes-intermediaries,
|
||||
nodes-addresses, relationships, ...). We cache it under
|
||||
$HERMES_OSINT_CACHE/icij/ (default: ~/.cache/hermes-osint/icij/) and search
|
||||
locally so the agent doesn't re-download for every query.
|
||||
|
||||
Output CSV columns match the original `fetch_icij_offshore.py` contract.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import io
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import urllib.request
|
||||
import zipfile
|
||||
from pathlib import Path
|
||||
|
||||
BULK_URL = "https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip"
|
||||
|
||||
COLUMNS = [
|
||||
"node_id",
|
||||
"name",
|
||||
"node_type",
|
||||
"country_codes",
|
||||
"countries",
|
||||
"jurisdiction",
|
||||
"incorporation_date",
|
||||
"inactivation_date",
|
||||
"source",
|
||||
"entity_url",
|
||||
"connections",
|
||||
]
|
||||
|
||||
|
||||
def _cache_dir() -> Path:
|
||||
base = os.environ.get("HERMES_OSINT_CACHE")
|
||||
if base:
|
||||
return Path(base) / "icij"
|
||||
return Path.home() / ".cache" / "hermes-osint" / "icij"
|
||||
|
||||
|
||||
def _download(dest: Path, force: bool = False) -> Path:
|
||||
"""Download (or reuse cached) ICIJ bulk ZIP."""
|
||||
dest.mkdir(parents=True, exist_ok=True)
|
||||
zip_path = dest / "full-oldb.zip"
|
||||
if zip_path.exists() and not force:
|
||||
# Re-check age: refetch if older than 30 days.
|
||||
age_days = (time.time() - zip_path.stat().st_mtime) / 86400
|
||||
if age_days < 30:
|
||||
return zip_path
|
||||
print(f"Downloading ICIJ bulk database (~70 MB) to {zip_path}", file=sys.stderr)
|
||||
req = urllib.request.Request(
|
||||
BULK_URL,
|
||||
headers={"User-Agent": "hermes-agent osint-investigation skill"},
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=120) as resp: # noqa: S310
|
||||
tmp = zip_path.with_suffix(".zip.tmp")
|
||||
with open(tmp, "wb") as fh:
|
||||
while True:
|
||||
chunk = resp.read(1 << 16)
|
||||
if not chunk:
|
||||
break
|
||||
fh.write(chunk)
|
||||
tmp.replace(zip_path)
|
||||
return zip_path
|
||||
|
||||
|
||||
def _open_csv(zf: zipfile.ZipFile, name_pattern: str):
|
||||
"""Open the first CSV matching name_pattern (case-insensitive substring)."""
|
||||
for info in zf.infolist():
|
||||
if name_pattern.lower() in info.filename.lower() and info.filename.lower().endswith(".csv"):
|
||||
return zf.open(info), info.filename
|
||||
return None, None
|
||||
|
||||
|
||||
def _match(needle_norm: str, hay: str) -> bool:
|
||||
return needle_norm in (hay or "").upper()
|
||||
|
||||
|
||||
def _normalize_query(s: str) -> str:
|
||||
s = s.upper()
|
||||
s = re.sub(r"[^\w\s]", " ", s)
|
||||
s = re.sub(r"\s+", " ", s).strip()
|
||||
return s
|
||||
|
||||
|
||||
def fetch(
|
||||
entity: str | None,
|
||||
officer: str | None,
|
||||
jurisdiction: str | None,
|
||||
out_path: str,
|
||||
cache_dir: Path,
|
||||
force_refresh: bool = False,
|
||||
limit: int = 500,
|
||||
) -> int:
|
||||
zip_path = _download(cache_dir, force=force_refresh)
|
||||
rows: list[dict[str, str]] = []
|
||||
needles: list[tuple[str, str]] = [] # (kind, normalized needle)
|
||||
if entity:
|
||||
needles.append(("Entity", _normalize_query(entity)))
|
||||
if officer:
|
||||
needles.append(("Officer", _normalize_query(officer)))
|
||||
jur_norm = _normalize_query(jurisdiction) if jurisdiction else None
|
||||
|
||||
targets = [
|
||||
("Entity", "nodes-entities"),
|
||||
("Officer", "nodes-officers"),
|
||||
("Intermediary", "nodes-intermediaries"),
|
||||
]
|
||||
|
||||
with zipfile.ZipFile(zip_path) as zf:
|
||||
for node_type, csv_substring in targets:
|
||||
relevant_needles = [n for (k, n) in needles if k in (node_type, "Entity", "Officer")] or []
|
||||
# Only scan a CSV if we have a needle that could plausibly match it,
|
||||
# or if we have ONLY a jurisdiction filter.
|
||||
applicable_needles = [n for (k, n) in needles if k == node_type]
|
||||
if needles and not applicable_needles and not jur_norm:
|
||||
continue
|
||||
stream, fname = _open_csv(zf, csv_substring)
|
||||
if not stream:
|
||||
continue
|
||||
with stream:
|
||||
text = io.TextIOWrapper(stream, encoding="utf-8", errors="replace")
|
||||
reader = csv.DictReader(text)
|
||||
for row in reader:
|
||||
name = (row.get("name") or "").strip()
|
||||
if not name:
|
||||
continue
|
||||
name_u = name.upper()
|
||||
matched = False
|
||||
for n in applicable_needles or relevant_needles:
|
||||
if _match(n, name_u):
|
||||
matched = True
|
||||
break
|
||||
if not needles:
|
||||
matched = True # jurisdiction-only sweep
|
||||
if not matched:
|
||||
continue
|
||||
jur = (row.get("jurisdiction_description") or row.get("country_codes") or "").strip()
|
||||
if jur_norm and jur_norm not in jur.upper() and jur_norm not in (row.get("countries") or "").upper():
|
||||
continue
|
||||
node_id = (row.get("node_id") or "").strip()
|
||||
rows.append(
|
||||
{
|
||||
"node_id": node_id,
|
||||
"name": name,
|
||||
"node_type": node_type,
|
||||
"country_codes": row.get("country_codes", "") or "",
|
||||
"countries": row.get("countries", "") or "",
|
||||
"jurisdiction": jur,
|
||||
"incorporation_date": row.get("incorporation_date", "") or "",
|
||||
"inactivation_date": row.get("inactivation_date", "") or "",
|
||||
"source": row.get("sourceID", "") or row.get("source", "") or "",
|
||||
"entity_url": (
|
||||
f"https://offshoreleaks.icij.org/nodes/{node_id}" if node_id else ""
|
||||
),
|
||||
"connections": "",
|
||||
}
|
||||
)
|
||||
if len(rows) >= limit:
|
||||
break
|
||||
if len(rows) >= limit:
|
||||
break
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
if not rows:
|
||||
bits = []
|
||||
if entity:
|
||||
bits.append(f"entity={entity!r}")
|
||||
if officer:
|
||||
bits.append(f"officer={officer!r}")
|
||||
if jurisdiction:
|
||||
bits.append(f"jurisdiction={jurisdiction!r}")
|
||||
print(
|
||||
f"ICIJ: 0 matches for {', '.join(bits)}. "
|
||||
"The bulk database covers offshore leaks (Panama, Paradise, Pandora, "
|
||||
"Bahamas, Offshore Leaks). Most private US individuals are NOT in it.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--entity", help="Search by entity name (substring, case-insensitive)")
|
||||
p.add_argument("--officer", help="Search by officer / individual name (substring, case-insensitive)")
|
||||
p.add_argument("--jurisdiction", help="Filter results by jurisdiction substring")
|
||||
p.add_argument("--limit", type=int, default=500)
|
||||
p.add_argument("--out", required=True)
|
||||
p.add_argument(
|
||||
"--cache-dir",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="Override cache directory (default: $HERMES_OSINT_CACHE/icij or ~/.cache/hermes-osint/icij)",
|
||||
)
|
||||
p.add_argument(
|
||||
"--force-refresh",
|
||||
action="store_true",
|
||||
help="Re-download the bulk ZIP even if a recent cached copy exists.",
|
||||
)
|
||||
a = p.parse_args()
|
||||
if not (a.entity or a.officer or a.jurisdiction):
|
||||
p.error("must supply at least one of --entity / --officer / --jurisdiction")
|
||||
n = fetch(
|
||||
entity=a.entity,
|
||||
officer=a.officer,
|
||||
jurisdiction=a.jurisdiction,
|
||||
out_path=a.out,
|
||||
cache_dir=a.cache_dir or _cache_dir(),
|
||||
force_refresh=a.force_refresh,
|
||||
limit=a.limit,
|
||||
)
|
||||
print(f"Wrote {n} ICIJ Offshore Leaks rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,203 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search NYC property records via ACRIS (Automated City Register Information System).
|
||||
|
||||
Uses the city's Socrata-backed open data API. No auth required for read access.
|
||||
|
||||
Datasets:
|
||||
bnx9-e6tj — Real Property Master (one row per recorded document)
|
||||
636b-3b5g — Real Property Parties (names — grantor, grantee, etc.)
|
||||
8h5j-fqxa — Real Property Legal (lot / property identifiers)
|
||||
uqqa-hym2 — Real Property References
|
||||
|
||||
The Parties dataset has the names. We search by name and optionally join to
|
||||
Master to get the doc type and date.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import sys
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get_json # noqa: E402
|
||||
|
||||
PARTIES_URL = "https://data.cityofnewyork.us/resource/636b-3b5g.json"
|
||||
MASTER_URL = "https://data.cityofnewyork.us/resource/bnx9-e6tj.json"
|
||||
|
||||
PARTY_TYPE = {
|
||||
"1": "grantor (seller / mortgagor / debtor)",
|
||||
"2": "grantee (buyer / mortgagee / creditor)",
|
||||
"3": "other party",
|
||||
}
|
||||
|
||||
BOROUGH = {
|
||||
"1": "Manhattan",
|
||||
"2": "Bronx",
|
||||
"3": "Brooklyn",
|
||||
"4": "Queens",
|
||||
"5": "Staten Island",
|
||||
}
|
||||
|
||||
COLUMNS = [
|
||||
"document_id",
|
||||
"name",
|
||||
"party_type",
|
||||
"party_role",
|
||||
"address_1",
|
||||
"address_2",
|
||||
"city",
|
||||
"state",
|
||||
"zip",
|
||||
"country",
|
||||
"doc_type",
|
||||
"doc_date",
|
||||
"recorded_date",
|
||||
"borough",
|
||||
"amount",
|
||||
"filing_url",
|
||||
]
|
||||
|
||||
|
||||
def _filing_url(document_id: str) -> str:
|
||||
if not document_id:
|
||||
return ""
|
||||
return (
|
||||
f"https://a836-acris.nyc.gov/DS/DocumentSearch/DocumentImageView?doc_id={document_id}"
|
||||
)
|
||||
|
||||
|
||||
def fetch(
|
||||
name: str | None,
|
||||
address: str | None,
|
||||
party_type: str | None,
|
||||
limit: int,
|
||||
out_path: str,
|
||||
enrich: bool = True,
|
||||
) -> int:
|
||||
if not (name or address):
|
||||
raise SystemExit("must supply --name or --address")
|
||||
|
||||
where_clauses: list[str] = []
|
||||
if name:
|
||||
safe = name.upper().replace("'", "''")
|
||||
where_clauses.append(f"upper(name) like '%{safe}%'")
|
||||
if address:
|
||||
safe_addr = address.upper().replace("'", "''")
|
||||
where_clauses.append(f"upper(address_1) like '%{safe_addr}%'")
|
||||
if party_type and party_type in {"1", "2", "3"}:
|
||||
where_clauses.append(f"party_type='{party_type}'")
|
||||
|
||||
params = {
|
||||
"$where": " AND ".join(where_clauses),
|
||||
"$limit": str(limit),
|
||||
}
|
||||
url = f"{PARTIES_URL}?{urllib.parse.urlencode(params)}"
|
||||
parties = get_json(url)
|
||||
if not isinstance(parties, list):
|
||||
raise SystemExit(f"Unexpected ACRIS response: {parties!r}")
|
||||
|
||||
# Enrich with master record (doc_type, dates, borough, amount).
|
||||
doc_ids: list[str] = sorted({
|
||||
d for d in (p.get("document_id") for p in parties) if d
|
||||
})
|
||||
masters: dict[str, dict] = {}
|
||||
if enrich and doc_ids:
|
||||
# Batch up to 100 doc_ids per request (Socrata IN-list is fine for this).
|
||||
for i in range(0, len(doc_ids), 100):
|
||||
chunk = doc_ids[i : i + 100]
|
||||
id_list = ",".join(f"'{d}'" for d in chunk)
|
||||
master_params = {
|
||||
"$where": f"document_id in ({id_list})",
|
||||
"$limit": "100",
|
||||
}
|
||||
url = f"{MASTER_URL}?{urllib.parse.urlencode(master_params)}"
|
||||
try:
|
||||
rows = get_json(url)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"ACRIS master lookup failed for chunk: {e}", file=sys.stderr)
|
||||
continue
|
||||
if isinstance(rows, list):
|
||||
for r in rows:
|
||||
did = r.get("document_id", "")
|
||||
if did:
|
||||
masters[did] = r
|
||||
|
||||
out_rows: list[dict[str, str]] = []
|
||||
for p in parties:
|
||||
did = p.get("document_id", "") or ""
|
||||
m = masters.get(did, {})
|
||||
out_rows.append(
|
||||
{
|
||||
"document_id": did,
|
||||
"name": p.get("name", "") or "",
|
||||
"party_type": p.get("party_type", "") or "",
|
||||
"party_role": PARTY_TYPE.get(p.get("party_type", ""), ""),
|
||||
"address_1": p.get("address_1", "") or "",
|
||||
"address_2": p.get("address_2", "") or "",
|
||||
"city": p.get("city", "") or "",
|
||||
"state": p.get("state", "") or "",
|
||||
"zip": p.get("zip", "") or "",
|
||||
"country": p.get("country", "") or "",
|
||||
"doc_type": m.get("doc_type", "") or "",
|
||||
"doc_date": (m.get("document_date", "") or "")[:10],
|
||||
"recorded_date": (m.get("recorded_datetime", "") or "")[:10],
|
||||
"borough": BOROUGH.get(m.get("recorded_borough", ""), m.get("recorded_borough", "")),
|
||||
"amount": m.get("document_amt", "") or "",
|
||||
"filing_url": _filing_url(did),
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(out_rows)
|
||||
|
||||
if not out_rows:
|
||||
filters = []
|
||||
if name:
|
||||
filters.append(f"name={name!r}")
|
||||
if address:
|
||||
filters.append(f"address={address!r}")
|
||||
print(
|
||||
f"NYC ACRIS: 0 records for {', '.join(filters)}. "
|
||||
"ACRIS covers ONLY NYC (5 boroughs). For property records elsewhere, "
|
||||
"search the relevant county recorder directly.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(out_rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--name", help="Party name substring (case-insensitive)")
|
||||
p.add_argument("--address", help="Address line 1 substring")
|
||||
p.add_argument(
|
||||
"--party-type",
|
||||
choices=["1", "2", "3"],
|
||||
help="Filter party type: 1=grantor (seller/mortgagor), 2=grantee (buyer/mortgagee), 3=other",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=200)
|
||||
p.add_argument(
|
||||
"--no-enrich",
|
||||
action="store_true",
|
||||
help="Skip the master-document lookup that adds doc_type/date/amount",
|
||||
)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(
|
||||
name=a.name,
|
||||
address=a.address,
|
||||
party_type=a.party_type,
|
||||
limit=a.limit,
|
||||
out_path=a.out,
|
||||
enrich=not a.no_enrich,
|
||||
)
|
||||
print(f"Wrote {n} NYC ACRIS rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,175 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Fetch OFAC SDN list (CSV format) and normalize.
|
||||
|
||||
Public endpoint: https://www.treasury.gov/ofac/downloads/sdn.csv
|
||||
Format reference: https://ofac.treasury.gov/specially-designated-nationals-and-blocked-persons-list-sdn-human-readable-lists
|
||||
|
||||
The SDN CSV uses a specific 12-column format with no header row:
|
||||
ent_num, sdn_name, sdn_type, program, title, call_sign, vess_type,
|
||||
tonnage, grt, vess_flag, vess_owner, remarks
|
||||
Address and AKA records live in separate files. We fetch all three and join.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import io
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get # noqa: E402
|
||||
|
||||
SDN_URL = "https://www.treasury.gov/ofac/downloads/sdn.csv"
|
||||
ADD_URL = "https://www.treasury.gov/ofac/downloads/add.csv"
|
||||
ALT_URL = "https://www.treasury.gov/ofac/downloads/alt.csv"
|
||||
|
||||
SDN_COLS = [
|
||||
"ent_num", "sdn_name", "sdn_type", "program", "title",
|
||||
"call_sign", "vess_type", "tonnage", "grt", "vess_flag",
|
||||
"vess_owner", "remarks",
|
||||
]
|
||||
ADD_COLS = [
|
||||
"ent_num", "add_num", "address", "city_state_zip", "country", "add_remarks",
|
||||
]
|
||||
ALT_COLS = [
|
||||
"ent_num", "alt_num", "alt_type", "alt_name", "alt_remarks",
|
||||
]
|
||||
|
||||
COLUMNS = [
|
||||
"entity_id",
|
||||
"name",
|
||||
"entity_type",
|
||||
"program_list",
|
||||
"title",
|
||||
"nationalities",
|
||||
"aka_list",
|
||||
"addresses",
|
||||
"dob",
|
||||
"pob",
|
||||
"remarks",
|
||||
"last_updated",
|
||||
]
|
||||
|
||||
_TYPE_MAP = {
|
||||
"individual": "individual",
|
||||
"entity": "entity",
|
||||
"vessel": "vessel",
|
||||
"aircraft": "aircraft",
|
||||
}
|
||||
|
||||
|
||||
def _read_csv(url: str, columns: list[str]) -> list[dict[str, str]]:
|
||||
body = get(url, timeout=60).decode("latin-1", errors="replace")
|
||||
reader = csv.reader(io.StringIO(body))
|
||||
out = []
|
||||
for row in reader:
|
||||
if not row:
|
||||
continue
|
||||
# Pad/truncate to expected width.
|
||||
row = row[: len(columns)] + [""] * (len(columns) - len(row))
|
||||
out.append(dict(zip(columns, row)))
|
||||
return out
|
||||
|
||||
|
||||
def _strip_quotes(s: str) -> str:
|
||||
s = s.strip()
|
||||
if s.startswith('"') and s.endswith('"'):
|
||||
s = s[1:-1]
|
||||
if s == "-0-":
|
||||
return ""
|
||||
return s
|
||||
|
||||
|
||||
def fetch(
|
||||
program: str | None,
|
||||
entity_type: str | None,
|
||||
out_path: str,
|
||||
) -> int:
|
||||
sdn = _read_csv(SDN_URL, SDN_COLS)
|
||||
addresses = _read_csv(ADD_URL, ADD_COLS)
|
||||
akas = _read_csv(ALT_URL, ALT_COLS)
|
||||
|
||||
addr_by_ent: dict[str, list[str]] = defaultdict(list)
|
||||
for a in addresses:
|
||||
ent = _strip_quotes(a["ent_num"])
|
||||
parts = [
|
||||
_strip_quotes(a[c])
|
||||
for c in ("address", "city_state_zip", "country")
|
||||
if _strip_quotes(a[c])
|
||||
]
|
||||
if parts:
|
||||
addr_by_ent[ent].append(", ".join(parts))
|
||||
|
||||
aka_by_ent: dict[str, list[str]] = defaultdict(list)
|
||||
for k in akas:
|
||||
ent = _strip_quotes(k["ent_num"])
|
||||
name = _strip_quotes(k["alt_name"])
|
||||
if name:
|
||||
aka_by_ent[ent].append(name)
|
||||
|
||||
rows: list[dict[str, str]] = []
|
||||
for r in sdn:
|
||||
ent_num = _strip_quotes(r["ent_num"])
|
||||
if not ent_num:
|
||||
continue
|
||||
sdn_type = _TYPE_MAP.get(_strip_quotes(r["sdn_type"]).lower(), _strip_quotes(r["sdn_type"]))
|
||||
if entity_type and sdn_type != entity_type:
|
||||
continue
|
||||
progs = _strip_quotes(r["program"])
|
||||
if program and program.upper() not in progs.upper().split(";"):
|
||||
continue
|
||||
remarks = _strip_quotes(r["remarks"])
|
||||
# DOB / POB are commonly embedded in remarks for individuals.
|
||||
dob = ""
|
||||
pob = ""
|
||||
if sdn_type == "individual" and remarks:
|
||||
for chunk in remarks.split(";"):
|
||||
ch = chunk.strip()
|
||||
if ch.upper().startswith("DOB"):
|
||||
dob = ch.split(maxsplit=1)[1] if " " in ch else ""
|
||||
elif ch.upper().startswith("POB"):
|
||||
pob = ch.split(maxsplit=1)[1] if " " in ch else ""
|
||||
rows.append(
|
||||
{
|
||||
"entity_id": ent_num,
|
||||
"name": _strip_quotes(r["sdn_name"]),
|
||||
"entity_type": sdn_type,
|
||||
"program_list": "; ".join(p.strip() for p in progs.split(";") if p.strip()),
|
||||
"title": _strip_quotes(r["title"]),
|
||||
"nationalities": "", # not in this CSV; available in XML format
|
||||
"aka_list": "; ".join(aka_by_ent.get(ent_num, [])),
|
||||
"addresses": "; ".join(addr_by_ent.get(ent_num, [])),
|
||||
"dob": dob,
|
||||
"pob": pob,
|
||||
"remarks": remarks,
|
||||
"last_updated": "",
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--program", help="Filter to specific sanctions program (e.g. SDGT, IRAN)")
|
||||
p.add_argument(
|
||||
"--entity-type",
|
||||
choices=["individual", "entity", "vessel", "aircraft"],
|
||||
help="Filter to a specific entity type",
|
||||
)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(program=a.program, entity_type=a.entity_type, out_path=a.out)
|
||||
print(f"Wrote {n} OFAC SDN rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,192 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search OpenCorporates company registry data.
|
||||
|
||||
OpenCorporates aggregates ~200M companies from 130+ jurisdictions. The
|
||||
public API requires an API token (free tier: 500 calls/month). Set
|
||||
OPENCORPORATES_API_TOKEN in env or pass --token.
|
||||
|
||||
Without a token, this script falls back to scraping the public HTML
|
||||
search page (limited fields, more brittle, no jurisdiction filter).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get, get_json # noqa: E402
|
||||
|
||||
API_URL = "https://api.opencorporates.com/v0.4/companies/search"
|
||||
HTML_URL = "https://opencorporates.com/companies"
|
||||
|
||||
COLUMNS = [
|
||||
"name",
|
||||
"company_number",
|
||||
"jurisdiction_code",
|
||||
"jurisdiction_name",
|
||||
"incorporation_date",
|
||||
"dissolution_date",
|
||||
"company_type",
|
||||
"status",
|
||||
"registered_address",
|
||||
"opencorporates_url",
|
||||
"officers_count",
|
||||
"source",
|
||||
]
|
||||
|
||||
|
||||
def _via_api(query: str, jurisdiction: str | None, token: str, limit: int) -> list[dict]:
|
||||
params = {
|
||||
"q": query,
|
||||
"api_token": token,
|
||||
"per_page": str(min(limit, 100)),
|
||||
}
|
||||
if jurisdiction:
|
||||
params["jurisdiction_code"] = jurisdiction
|
||||
url = f"{API_URL}?{urllib.parse.urlencode(params)}"
|
||||
payload = get_json(url)
|
||||
if not isinstance(payload, dict):
|
||||
return []
|
||||
results = payload.get("results", {}).get("companies", []) or []
|
||||
return [r.get("company", {}) for r in results if isinstance(r, dict)]
|
||||
|
||||
|
||||
def _via_html(query: str, limit: int) -> list[dict]:
|
||||
"""Best-effort HTML fallback when no API token is available."""
|
||||
params = {"q": query, "utf8": "✓"}
|
||||
url = f"{HTML_URL}?{urllib.parse.urlencode(params)}"
|
||||
body = get(url, user_agent="Mozilla/5.0 hermes-osint").decode("utf-8", errors="replace")
|
||||
# Each result is in <li class="company"> ... </li> with name, url, status
|
||||
pattern = re.compile(
|
||||
r'<li[^>]*class="[^"]*company[^"]*"[^>]*>.*?'
|
||||
r'<a[^>]+href="(?P<url>/companies/[^"]+)"[^>]*>(?P<name>[^<]+)</a>'
|
||||
r'(?:.*?<span[^>]*class="[^"]*jurisdiction[^"]*"[^>]*>(?P<jur>[^<]+)</span>)?'
|
||||
r"(?:.*?<dt[^>]*>(?:Company\s+Number|Number)</dt>\s*<dd[^>]*>(?P<num>[^<]+)</dd>)?",
|
||||
re.DOTALL | re.IGNORECASE,
|
||||
)
|
||||
out = []
|
||||
for m in pattern.finditer(body):
|
||||
if len(out) >= limit:
|
||||
break
|
||||
url_path = m.group("url").strip()
|
||||
out.append(
|
||||
{
|
||||
"name": (m.group("name") or "").strip(),
|
||||
"opencorporates_url": f"https://opencorporates.com{url_path}",
|
||||
"jurisdiction_code": (m.group("jur") or "").strip(),
|
||||
"company_number": (m.group("num") or "").strip(),
|
||||
"_via": "html",
|
||||
}
|
||||
)
|
||||
return out
|
||||
|
||||
|
||||
def fetch(
|
||||
query: str,
|
||||
jurisdiction: str | None,
|
||||
token: str | None,
|
||||
limit: int,
|
||||
out_path: str,
|
||||
) -> int:
|
||||
if token:
|
||||
try:
|
||||
companies = _via_api(query, jurisdiction, token, limit)
|
||||
source_tag = "api"
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(
|
||||
f"OpenCorporates API call failed ({e}); falling back to HTML.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
companies = _via_html(query, limit)
|
||||
source_tag = "html-fallback"
|
||||
else:
|
||||
print(
|
||||
"OPENCORPORATES_API_TOKEN not set — using HTML fallback (limited fields). "
|
||||
"Get a free token at https://opencorporates.com/api_accounts/new",
|
||||
file=sys.stderr,
|
||||
)
|
||||
companies = _via_html(query, limit)
|
||||
source_tag = "html"
|
||||
|
||||
rows: list[dict[str, str]] = []
|
||||
for c in companies[:limit]:
|
||||
if c.get("_via") == "html":
|
||||
rows.append(
|
||||
{
|
||||
"name": c.get("name", ""),
|
||||
"company_number": c.get("company_number", ""),
|
||||
"jurisdiction_code": c.get("jurisdiction_code", ""),
|
||||
"jurisdiction_name": "",
|
||||
"incorporation_date": "",
|
||||
"dissolution_date": "",
|
||||
"company_type": "",
|
||||
"status": "",
|
||||
"registered_address": "",
|
||||
"opencorporates_url": c.get("opencorporates_url", ""),
|
||||
"officers_count": "",
|
||||
"source": source_tag,
|
||||
}
|
||||
)
|
||||
continue
|
||||
addr = c.get("registered_address_in_full") or ""
|
||||
rows.append(
|
||||
{
|
||||
"name": c.get("name", "") or "",
|
||||
"company_number": c.get("company_number", "") or "",
|
||||
"jurisdiction_code": c.get("jurisdiction_code", "") or "",
|
||||
"jurisdiction_name": "",
|
||||
"incorporation_date": c.get("incorporation_date", "") or "",
|
||||
"dissolution_date": c.get("dissolution_date", "") or "",
|
||||
"company_type": c.get("company_type", "") or "",
|
||||
"status": c.get("current_status", "") or c.get("inactive", "") or "",
|
||||
"registered_address": addr,
|
||||
"opencorporates_url": c.get("opencorporates_url", "") or "",
|
||||
"officers_count": str(c.get("officers", {}).get("total_count", "") if c.get("officers") else ""),
|
||||
"source": source_tag,
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
if not rows:
|
||||
print(
|
||||
f"OpenCorporates: 0 matches for query={query!r}"
|
||||
f"{f' jurisdiction={jurisdiction!r}' if jurisdiction else ''}.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--query", required=True, help="Company name search")
|
||||
p.add_argument(
|
||||
"--jurisdiction",
|
||||
help="Jurisdiction code, e.g. 'us_ny', 'us_de', 'gb', 'sg' (lowercased OpenCorporates style)",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=50)
|
||||
p.add_argument("--token", default=os.environ.get("OPENCORPORATES_API_TOKEN"))
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(
|
||||
query=a.query,
|
||||
jurisdiction=a.jurisdiction,
|
||||
token=a.token,
|
||||
limit=a.limit,
|
||||
out_path=a.out,
|
||||
)
|
||||
print(f"Wrote {n} OpenCorporates rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,184 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Fetch SEC EDGAR filings index for a given CIK or company name.
|
||||
|
||||
SEC requires a User-Agent header with contact info. Set SEC_USER_AGENT,
|
||||
e.g. SEC_USER_AGENT="Research example@example.com".
|
||||
|
||||
Filings JSON is published at:
|
||||
https://data.sec.gov/submissions/CIK<10-digit-padded>.json
|
||||
|
||||
Company lookup uses:
|
||||
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&company=<name>&output=atom
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get, get_json # noqa: E402
|
||||
|
||||
SUBMISSIONS_URL = "https://data.sec.gov/submissions/CIK{cik}.json"
|
||||
COLUMNS = [
|
||||
"cik",
|
||||
"company_name",
|
||||
"form_type",
|
||||
"filing_date",
|
||||
"accession_number",
|
||||
"primary_document",
|
||||
"filing_url",
|
||||
"reporting_period",
|
||||
]
|
||||
|
||||
|
||||
def _ua() -> str:
|
||||
ua = os.environ.get("SEC_USER_AGENT", "").strip()
|
||||
if not ua:
|
||||
raise SystemExit(
|
||||
"SEC requires a User-Agent with contact info. "
|
||||
"Set SEC_USER_AGENT='Your Name your@email'."
|
||||
)
|
||||
return ua
|
||||
|
||||
|
||||
def _resolve_cik(company: str) -> tuple[str, str]:
|
||||
"""Resolve a company name to a CIK via EDGAR's atom feed.
|
||||
|
||||
Returns (cik, resolved_company_name). The feed entries also reveal whether
|
||||
the match is an individual filer (Form 3/4/5 only) — surfaced in the
|
||||
return value so callers can warn.
|
||||
"""
|
||||
url = "https://www.sec.gov/cgi-bin/browse-edgar"
|
||||
params = {"action": "getcompany", "company": company, "output": "atom", "owner": "include"}
|
||||
body = get(url, params=params, user_agent=_ua()).decode("utf-8", errors="replace")
|
||||
m = re.search(r"CIK=(\d{10})", body)
|
||||
if not m:
|
||||
raise SystemExit(f"Could not resolve CIK for company={company!r}")
|
||||
cik = m.group(1)
|
||||
name_m = re.search(r"<title>([^<]+)\s*\((\d{10})\)</title>", body)
|
||||
resolved = name_m.group(1).strip() if name_m else ""
|
||||
return cik, resolved
|
||||
|
||||
|
||||
def fetch(
|
||||
cik: str | None,
|
||||
company: str | None,
|
||||
types: list[str],
|
||||
since: str | None,
|
||||
out_path: str,
|
||||
) -> int:
|
||||
resolved_name = ""
|
||||
if not cik and company:
|
||||
try:
|
||||
cik, resolved_name = _resolve_cik(company) # type: ignore[assignment]
|
||||
except SystemExit as e:
|
||||
# Write empty CSV with header so downstream tools still work,
|
||||
# and tell the user clearly.
|
||||
print(f"SEC EDGAR: {e}", file=sys.stderr)
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
csv.DictWriter(fh, fieldnames=COLUMNS).writeheader()
|
||||
return 0
|
||||
if resolved_name:
|
||||
print(
|
||||
f"Resolved company={company!r} → CIK {cik} ({resolved_name})",
|
||||
file=sys.stderr,
|
||||
)
|
||||
if not cik:
|
||||
raise SystemExit("must supply --cik or --company")
|
||||
cik = cik.zfill(10)
|
||||
url = SUBMISSIONS_URL.format(cik=cik)
|
||||
payload = get_json(url, user_agent=_ua())
|
||||
if not isinstance(payload, dict):
|
||||
raise SystemExit(f"Unexpected EDGAR response shape for CIK {cik}")
|
||||
name = payload.get("name", "")
|
||||
recent = (payload.get("filings", {}) or {}).get("recent", {}) or {}
|
||||
form = recent.get("form", [])
|
||||
date = recent.get("filingDate", [])
|
||||
accession = recent.get("accessionNumber", [])
|
||||
primary_doc = recent.get("primaryDocument", [])
|
||||
period = recent.get("reportDate", [])
|
||||
|
||||
# Histogram of available filing types — useful for surfacing why a filter
|
||||
# returned 0 (e.g. user asked for 10-K on an individual Form 4 filer).
|
||||
type_hist: dict[str, int] = {}
|
||||
for ftype in form:
|
||||
type_hist[ftype] = type_hist.get(ftype, 0) + 1
|
||||
|
||||
type_set = {t.strip().upper() for t in types} if types else None
|
||||
rows: list[dict[str, str]] = []
|
||||
for i, ftype in enumerate(form):
|
||||
if type_set and ftype.upper() not in type_set:
|
||||
continue
|
||||
fdate = date[i] if i < len(date) else ""
|
||||
if since and fdate and fdate < since:
|
||||
continue
|
||||
acc = accession[i] if i < len(accession) else ""
|
||||
pdoc = primary_doc[i] if i < len(primary_doc) else ""
|
||||
acc_nodash = acc.replace("-", "")
|
||||
filing_url = (
|
||||
f"https://www.sec.gov/Archives/edgar/data/{int(cik)}/{acc_nodash}/{pdoc}"
|
||||
if acc and pdoc
|
||||
else ""
|
||||
)
|
||||
rows.append(
|
||||
{
|
||||
"cik": cik,
|
||||
"company_name": name,
|
||||
"form_type": ftype,
|
||||
"filing_date": fdate,
|
||||
"accession_number": acc,
|
||||
"primary_document": pdoc,
|
||||
"filing_url": filing_url,
|
||||
"reporting_period": period[i] if i < len(period) else "",
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
|
||||
if not rows and type_hist:
|
||||
top = sorted(type_hist.items(), key=lambda kv: -kv[1])[:8]
|
||||
hist_str = ", ".join(f"{t}={n}" for t, n in top)
|
||||
print(
|
||||
f"Warning: SEC EDGAR CIK {cik} ({name}) has {sum(type_hist.values())} "
|
||||
f"recent filings but NONE match types={types}. "
|
||||
f"Available form types: {hist_str}.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
# Insider-filer heuristic: only Form 3/4/5 → individual person, not a company.
|
||||
company_types = {"10-K", "10-Q", "8-K", "20-F", "DEF 14A", "S-1"}
|
||||
if not (set(type_hist.keys()) & company_types):
|
||||
print(
|
||||
f"Note: CIK {cik} appears to be an INDIVIDUAL filer "
|
||||
f"(insider Form 3/4/5 only), not a corporate registrant. "
|
||||
f"The resolver may have matched an officer/director named "
|
||||
f"{company!r} rather than a company.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--cik", help="Central Index Key (will be 10-digit zero-padded)")
|
||||
p.add_argument("--company", help="Resolve to CIK by company name")
|
||||
p.add_argument("--types", default="", help="Comma-separated form types (e.g. 10-K,10-Q,8-K)")
|
||||
p.add_argument("--since", help="Skip filings before YYYY-MM-DD")
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
types = [t for t in (a.types or "").split(",") if t.strip()]
|
||||
n = fetch(cik=a.cik, company=a.company, types=types, since=a.since, out_path=a.out)
|
||||
print(f"Wrote {n} EDGAR filing rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,146 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Fetch Senate Lobbying Disclosure (LD-1 / LD-2) filings.
|
||||
|
||||
Anonymous: 120 req/hour. Token (SENATE_LDA_TOKEN): 1200 req/hour.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import os
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get_json # noqa: E402
|
||||
|
||||
ENDPOINT = "https://lda.senate.gov/api/v1/filings/"
|
||||
COLUMNS = [
|
||||
"filing_uuid",
|
||||
"filing_type",
|
||||
"filing_year",
|
||||
"filing_period",
|
||||
"registrant_name",
|
||||
"registrant_id",
|
||||
"client_name",
|
||||
"client_id",
|
||||
"client_general_description",
|
||||
"income",
|
||||
"expenses",
|
||||
"lobbyists",
|
||||
"issues",
|
||||
"government_entities",
|
||||
"filing_date",
|
||||
]
|
||||
|
||||
|
||||
def fetch(
|
||||
client: str | None,
|
||||
registrant: str | None,
|
||||
year: int,
|
||||
token: str | None,
|
||||
out_path: str,
|
||||
page_size: int = 100,
|
||||
max_pages: int = 25,
|
||||
) -> int:
|
||||
params: dict = {"filing_year": year, "page_size": page_size}
|
||||
if client:
|
||||
params["client_name"] = client
|
||||
if registrant:
|
||||
params["registrant_name"] = registrant
|
||||
|
||||
headers = {"Authorization": f"Token {token}"} if token else None
|
||||
rows: list[dict[str, str]] = []
|
||||
url = ENDPOINT
|
||||
page = 0
|
||||
while page < max_pages:
|
||||
try:
|
||||
payload = get_json(url, params=params if page == 0 else None, headers=headers)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"Senate LDA error on page {page + 1}: {e}", file=sys.stderr)
|
||||
break
|
||||
if not isinstance(payload, dict):
|
||||
break
|
||||
results = payload.get("results", [])
|
||||
for r in results:
|
||||
client_obj = r.get("client") or {}
|
||||
registrant_obj = r.get("registrant") or {}
|
||||
lobbying_activities = r.get("lobbying_activities") or []
|
||||
lobbyists = []
|
||||
issues = []
|
||||
entities = []
|
||||
for la in lobbying_activities:
|
||||
for lob in la.get("lobbyists") or []:
|
||||
lob_obj = lob.get("lobbyist") or {}
|
||||
name = " ".join(
|
||||
x for x in (lob_obj.get("first_name", ""), lob_obj.get("last_name", "")) if x
|
||||
)
|
||||
if name:
|
||||
lobbyists.append(name)
|
||||
desc = la.get("description") or ""
|
||||
if desc:
|
||||
issues.append(desc)
|
||||
for ge in la.get("government_entities") or []:
|
||||
nm = ge.get("name") or ""
|
||||
if nm:
|
||||
entities.append(nm)
|
||||
rows.append(
|
||||
{
|
||||
"filing_uuid": r.get("filing_uuid", "") or "",
|
||||
"filing_type": r.get("filing_type", "") or "",
|
||||
"filing_year": str(r.get("filing_year", "") or year),
|
||||
"filing_period": r.get("filing_period", "") or "",
|
||||
"registrant_name": registrant_obj.get("name", "") or "",
|
||||
"registrant_id": str(registrant_obj.get("id", "") or ""),
|
||||
"client_name": client_obj.get("name", "") or "",
|
||||
"client_id": str(client_obj.get("id", "") or ""),
|
||||
"client_general_description": client_obj.get("general_description", "") or "",
|
||||
"income": str(r.get("income", "") or ""),
|
||||
"expenses": str(r.get("expenses", "") or ""),
|
||||
"lobbyists": "; ".join(sorted(set(lobbyists))),
|
||||
"issues": "; ".join(issues),
|
||||
"government_entities": "; ".join(sorted(set(entities))),
|
||||
"filing_date": (r.get("dt_posted") or "")[:10],
|
||||
}
|
||||
)
|
||||
next_url = payload.get("next")
|
||||
if not next_url:
|
||||
break
|
||||
url = next_url
|
||||
page += 1
|
||||
time.sleep(1.0 if not token else 0.3)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--client", help="Client name filter")
|
||||
p.add_argument("--registrant", help="Registrant (lobbying firm) name filter")
|
||||
p.add_argument("--year", type=int, default=2024)
|
||||
p.add_argument("--token", default=os.environ.get("SENATE_LDA_TOKEN"))
|
||||
p.add_argument("--max-pages", type=int, default=25)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
if not (a.client or a.registrant):
|
||||
p.error("must supply at least one of --client / --registrant")
|
||||
n = fetch(
|
||||
client=a.client,
|
||||
registrant=a.registrant,
|
||||
year=a.year,
|
||||
token=a.token,
|
||||
out_path=a.out,
|
||||
max_pages=a.max_pages,
|
||||
)
|
||||
print(f"Wrote {n} Senate LDA rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,170 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Fetch federal contracts/awards from USAspending.gov API v2.
|
||||
|
||||
No auth required. POST to /api/v2/search/spending_by_award/ with filters.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import urllib.request
|
||||
from pathlib import Path
|
||||
|
||||
ENDPOINT = "https://api.usaspending.gov/api/v2/search/spending_by_award/"
|
||||
COLUMNS = [
|
||||
"award_id",
|
||||
"recipient_name",
|
||||
"recipient_uei",
|
||||
"recipient_duns",
|
||||
"recipient_parent_name",
|
||||
"recipient_state",
|
||||
"awarding_agency",
|
||||
"awarding_sub_agency",
|
||||
"award_type",
|
||||
"award_amount",
|
||||
"award_date",
|
||||
"period_of_performance_start",
|
||||
"period_of_performance_end",
|
||||
"naics_code",
|
||||
"psc_code",
|
||||
"competition_extent",
|
||||
"description",
|
||||
]
|
||||
|
||||
# USAspending result column "code" → human label mapping for output.
|
||||
_FIELDS = [
|
||||
"Award ID",
|
||||
"Recipient Name",
|
||||
"Recipient UEI",
|
||||
"Recipient DUNS Number",
|
||||
"Recipient Parent Name",
|
||||
"Recipient State Code",
|
||||
"Awarding Agency",
|
||||
"Awarding Sub Agency",
|
||||
"Award Type",
|
||||
"Award Amount",
|
||||
"Start Date",
|
||||
"End Date",
|
||||
"NAICS Code",
|
||||
"PSC Code",
|
||||
"Type of Set Aside",
|
||||
"Description",
|
||||
]
|
||||
|
||||
|
||||
def _post(body: dict) -> dict:
|
||||
req = urllib.request.Request(
|
||||
ENDPOINT,
|
||||
data=json.dumps(body).encode("utf-8"),
|
||||
headers={"Content-Type": "application/json", "User-Agent": "hermes-agent osint-investigation"},
|
||||
method="POST",
|
||||
)
|
||||
with urllib.request.urlopen(req, timeout=60) as resp:
|
||||
return json.loads(resp.read().decode("utf-8"))
|
||||
|
||||
|
||||
def fetch(
|
||||
recipient: str | None,
|
||||
agency: str | None,
|
||||
fy: int,
|
||||
sole_source_only: bool,
|
||||
out_path: str,
|
||||
page_size: int = 100,
|
||||
max_pages: int = 20,
|
||||
) -> int:
|
||||
filters: dict = {
|
||||
"time_period": [{"start_date": f"{fy - 1}-10-01", "end_date": f"{fy}-09-30"}],
|
||||
# Contracts only by default; adjust award_type_codes for grants/loans.
|
||||
"award_type_codes": ["A", "B", "C", "D"],
|
||||
}
|
||||
if recipient:
|
||||
filters["recipient_search_text"] = [recipient]
|
||||
if agency:
|
||||
filters["agencies"] = [{"type": "awarding", "tier": "toptier", "name": agency}]
|
||||
|
||||
rows: list[dict[str, str]] = []
|
||||
page = 1
|
||||
while page <= max_pages:
|
||||
body = {
|
||||
"filters": filters,
|
||||
"fields": _FIELDS,
|
||||
"page": page,
|
||||
"limit": page_size,
|
||||
"sort": "Award Amount",
|
||||
"order": "desc",
|
||||
}
|
||||
try:
|
||||
payload = _post(body)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"USAspending error on page {page}: {e}", file=sys.stderr)
|
||||
break
|
||||
results = payload.get("results", [])
|
||||
if not results:
|
||||
break
|
||||
for r in results:
|
||||
set_aside = r.get("Type of Set Aside", "") or ""
|
||||
if sole_source_only and "sole" not in set_aside.lower():
|
||||
continue
|
||||
rows.append(
|
||||
{
|
||||
"award_id": r.get("Award ID", "") or "",
|
||||
"recipient_name": r.get("Recipient Name", "") or "",
|
||||
"recipient_uei": r.get("Recipient UEI", "") or "",
|
||||
"recipient_duns": r.get("Recipient DUNS Number", "") or "",
|
||||
"recipient_parent_name": r.get("Recipient Parent Name", "") or "",
|
||||
"recipient_state": r.get("Recipient State Code", "") or "",
|
||||
"awarding_agency": r.get("Awarding Agency", "") or "",
|
||||
"awarding_sub_agency": r.get("Awarding Sub Agency", "") or "",
|
||||
"award_type": r.get("Award Type", "") or "",
|
||||
"award_amount": str(r.get("Award Amount", "") or ""),
|
||||
"award_date": r.get("Start Date", "") or "",
|
||||
"period_of_performance_start": r.get("Start Date", "") or "",
|
||||
"period_of_performance_end": r.get("End Date", "") or "",
|
||||
"naics_code": str(r.get("NAICS Code", "") or ""),
|
||||
"psc_code": str(r.get("PSC Code", "") or ""),
|
||||
"competition_extent": set_aside,
|
||||
"description": r.get("Description", "") or "",
|
||||
}
|
||||
)
|
||||
meta = payload.get("page_metadata", {})
|
||||
if not meta.get("hasNext"):
|
||||
break
|
||||
page += 1
|
||||
time.sleep(0.5)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__)
|
||||
p.add_argument("--recipient", help="Recipient name search")
|
||||
p.add_argument("--agency", help="Awarding agency (top-tier)")
|
||||
p.add_argument("--fy", type=int, default=2024, help="Federal fiscal year")
|
||||
p.add_argument("--sole-source-only", action="store_true")
|
||||
p.add_argument("--max-pages", type=int, default=20)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
if not (a.recipient or a.agency):
|
||||
p.error("must supply at least one of --recipient / --agency")
|
||||
n = fetch(
|
||||
recipient=a.recipient,
|
||||
agency=a.agency,
|
||||
fy=a.fy,
|
||||
sole_source_only=a.sole_source_only,
|
||||
out_path=a.out,
|
||||
max_pages=a.max_pages,
|
||||
)
|
||||
print(f"Wrote {n} USAspending rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,142 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search the Internet Archive Wayback Machine via the CDX server.
|
||||
|
||||
The CDX API indexes ~900B+ archived web pages. Anonymous read access,
|
||||
no auth required. Useful for finding deleted / changed pages by URL,
|
||||
domain, or substring match.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import sys
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get_json # noqa: E402
|
||||
|
||||
BASE = "https://web.archive.org/cdx/search/cdx"
|
||||
|
||||
COLUMNS = [
|
||||
"url",
|
||||
"timestamp",
|
||||
"wayback_url",
|
||||
"mimetype",
|
||||
"status",
|
||||
"digest",
|
||||
"length",
|
||||
]
|
||||
|
||||
|
||||
def fetch(
|
||||
url_or_host: str,
|
||||
match_type: str,
|
||||
from_date: str | None,
|
||||
to_date: str | None,
|
||||
status: str | None,
|
||||
mime: str | None,
|
||||
collapse: str | None,
|
||||
limit: int,
|
||||
out_path: str,
|
||||
) -> int:
|
||||
params: dict[str, str] = {
|
||||
"url": url_or_host,
|
||||
"matchType": match_type,
|
||||
"output": "json",
|
||||
"limit": str(limit),
|
||||
}
|
||||
if from_date:
|
||||
params["from"] = from_date.replace("-", "")
|
||||
if to_date:
|
||||
params["to"] = to_date.replace("-", "")
|
||||
if status:
|
||||
params["filter"] = f"statuscode:{status}"
|
||||
if mime:
|
||||
params.setdefault("filter", "")
|
||||
# Multiple filters: CDX accepts repeated filter params via urlencode list
|
||||
params["filter"] = f"mimetype:{mime}"
|
||||
if collapse:
|
||||
params["collapse"] = collapse
|
||||
|
||||
url = f"{BASE}?{urllib.parse.urlencode(params)}"
|
||||
try:
|
||||
payload = get_json(url)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"Wayback CDX error: {e}", file=sys.stderr)
|
||||
payload = []
|
||||
|
||||
rows: list[dict[str, str]] = []
|
||||
if isinstance(payload, list) and len(payload) > 1:
|
||||
header = payload[0]
|
||||
idx = {h: i for i, h in enumerate(header)}
|
||||
for entry in payload[1:]:
|
||||
ts = entry[idx["timestamp"]] if "timestamp" in idx else ""
|
||||
orig = entry[idx["original"]] if "original" in idx else ""
|
||||
rows.append(
|
||||
{
|
||||
"url": orig,
|
||||
"timestamp": ts,
|
||||
"wayback_url": f"https://web.archive.org/web/{ts}/{orig}" if ts and orig else "",
|
||||
"mimetype": entry[idx["mimetype"]] if "mimetype" in idx else "",
|
||||
"status": entry[idx["statuscode"]] if "statuscode" in idx else "",
|
||||
"digest": entry[idx["digest"]] if "digest" in idx else "",
|
||||
"length": entry[idx["length"]] if "length" in idx else "",
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
if not rows:
|
||||
print(
|
||||
f"Wayback Machine: 0 captures for {url_or_host!r} matchType={match_type}.",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--url", required=True, help="URL or host to look up in the archive")
|
||||
p.add_argument(
|
||||
"--match",
|
||||
default="exact",
|
||||
choices=["exact", "prefix", "host", "domain"],
|
||||
help=(
|
||||
"exact: this URL only. "
|
||||
"prefix: this URL's path-prefix. "
|
||||
"host: any URL on this host. "
|
||||
"domain: any URL on this domain or subdomains."
|
||||
),
|
||||
)
|
||||
p.add_argument("--from-date", help="Earliest capture YYYY-MM-DD")
|
||||
p.add_argument("--to-date", help="Latest capture YYYY-MM-DD")
|
||||
p.add_argument("--status", help="HTTP status filter (e.g. 200)")
|
||||
p.add_argument("--mime", help="MIME type filter (e.g. text/html)")
|
||||
p.add_argument(
|
||||
"--collapse",
|
||||
help="Collapse adjacent identical entries (e.g. 'digest' for unique-content captures)",
|
||||
)
|
||||
p.add_argument("--limit", type=int, default=200)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(
|
||||
url_or_host=a.url,
|
||||
match_type=a.match,
|
||||
from_date=a.from_date,
|
||||
to_date=a.to_date,
|
||||
status=a.status,
|
||||
mime=a.mime,
|
||||
collapse=a.collapse,
|
||||
limit=a.limit,
|
||||
out_path=a.out,
|
||||
)
|
||||
print(f"Wrote {n} Wayback capture rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,267 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Search Wikipedia + Wikidata for an entity (person, company, place, concept).
|
||||
|
||||
Two free APIs:
|
||||
- Wikipedia OpenSearch + REST summary endpoint for narrative bio
|
||||
- Wikidata SPARQL endpoint for structured facts (birth, employer, awards, etc.)
|
||||
|
||||
Both are anonymous-access. Useful for resolving who-is-this-entity questions
|
||||
and surfacing cross-references that other sources can join against.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import urllib.parse
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).parent))
|
||||
from _http import get_json # noqa: E402
|
||||
|
||||
WP_OPENSEARCH = "https://en.wikipedia.org/w/api.php"
|
||||
WP_SUMMARY = "https://en.wikipedia.org/api/rest_v1/page/summary/"
|
||||
WD_ACTION = "https://www.wikidata.org/w/api.php"
|
||||
|
||||
COLUMNS = [
|
||||
"source",
|
||||
"label",
|
||||
"description",
|
||||
"qid",
|
||||
"wikipedia_title",
|
||||
"wikipedia_url",
|
||||
"wikidata_url",
|
||||
"instance_of",
|
||||
"country",
|
||||
"occupation",
|
||||
"employer",
|
||||
"date_of_birth",
|
||||
"place_of_birth",
|
||||
"summary",
|
||||
]
|
||||
|
||||
|
||||
def _wp_search(query: str, limit: int) -> list[dict]:
|
||||
params = {
|
||||
"action": "opensearch",
|
||||
"search": query,
|
||||
"limit": str(min(limit, 20)),
|
||||
"format": "json",
|
||||
}
|
||||
url = f"{WP_OPENSEARCH}?{urllib.parse.urlencode(params)}"
|
||||
data = get_json(url)
|
||||
if not isinstance(data, list) or len(data) < 4:
|
||||
return []
|
||||
titles, descs, urls = data[1], data[2], data[3]
|
||||
out = []
|
||||
for i, title in enumerate(titles):
|
||||
out.append(
|
||||
{
|
||||
"title": title,
|
||||
"description": descs[i] if i < len(descs) else "",
|
||||
"url": urls[i] if i < len(urls) else "",
|
||||
}
|
||||
)
|
||||
return out
|
||||
|
||||
|
||||
def _wp_summary(title: str) -> dict:
|
||||
"""Pull the REST summary for a title — short bio, image, type."""
|
||||
url = f"{WP_SUMMARY}{urllib.parse.quote(title.replace(' ', '_'))}"
|
||||
try:
|
||||
return get_json(url) # type: ignore[return-value]
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"Wikipedia summary lookup for {title!r} failed: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def _wd_lookup_by_qid(qid: str) -> dict:
|
||||
"""Pull common facts for a QID via Wikidata's Action API (no SPARQL).
|
||||
|
||||
The Action API is far more lenient on rate-limits than the SPARQL Query
|
||||
Service. We get claims as QIDs and then resolve labels in one batch call.
|
||||
"""
|
||||
# Properties of interest. The Action API returns claims as QIDs or
|
||||
# typed literals, so the slot mapping is local-only.
|
||||
interesting = {
|
||||
"P31": "instance_of",
|
||||
"P17": "country", # for orgs / places
|
||||
"P27": "country", # for individuals (country of citizenship)
|
||||
"P106": "occupation",
|
||||
"P108": "employer",
|
||||
"P569": "date_of_birth",
|
||||
"P19": "place_of_birth",
|
||||
}
|
||||
params = {
|
||||
"action": "wbgetentities",
|
||||
"ids": qid,
|
||||
"props": "claims",
|
||||
"format": "json",
|
||||
}
|
||||
url = f"{WD_ACTION}?{urllib.parse.urlencode(params)}"
|
||||
try:
|
||||
data = get_json(url)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"Wikidata wbgetentities for {qid} failed: {e}", file=sys.stderr)
|
||||
return {}
|
||||
if not isinstance(data, dict):
|
||||
return {}
|
||||
claims = (data.get("entities", {}).get(qid, {}) or {}).get("claims", {}) or {}
|
||||
|
||||
# Collect raw values (QIDs or literals) and remember which slot each
|
||||
# came from. Date literals come back as ISO strings; QIDs need a label
|
||||
# resolution pass.
|
||||
qid_to_slots: dict[str, list[str]] = {}
|
||||
facts: dict[str, list[str]] = {}
|
||||
for prop_id, slot in interesting.items():
|
||||
for claim in claims.get(prop_id, []) or []:
|
||||
v = (claim.get("mainsnak", {}) or {}).get("datavalue", {}) or {}
|
||||
vtype = v.get("type")
|
||||
value = v.get("value")
|
||||
if vtype == "wikibase-entityid" and isinstance(value, dict):
|
||||
vqid = value.get("id", "")
|
||||
if vqid:
|
||||
qid_to_slots.setdefault(vqid, [])
|
||||
if slot not in qid_to_slots[vqid]:
|
||||
qid_to_slots[vqid].append(slot)
|
||||
elif vtype == "time" and isinstance(value, dict):
|
||||
raw = value.get("time", "") or ""
|
||||
# +1955-10-28T00:00:00Z → 1955-10-28
|
||||
m = re.search(r"[+-]?(\d{4})-(\d{2})-(\d{2})", raw)
|
||||
if m:
|
||||
facts.setdefault(slot, []).append(
|
||||
f"{m.group(1)}-{m.group(2)}-{m.group(3)}"
|
||||
)
|
||||
elif vtype == "string":
|
||||
facts.setdefault(slot, []).append(str(value))
|
||||
|
||||
# Resolve labels for all referenced QIDs in one batch (up to 50 at a time).
|
||||
qids = list(qid_to_slots)
|
||||
for i in range(0, len(qids), 50):
|
||||
batch = qids[i : i + 50]
|
||||
params = {
|
||||
"action": "wbgetentities",
|
||||
"ids": "|".join(batch),
|
||||
"props": "labels",
|
||||
"languages": "en",
|
||||
"format": "json",
|
||||
}
|
||||
url = f"{WD_ACTION}?{urllib.parse.urlencode(params)}"
|
||||
try:
|
||||
data = get_json(url)
|
||||
except Exception as e: # noqa: BLE001
|
||||
print(f"Wikidata label batch failed: {e}", file=sys.stderr)
|
||||
continue
|
||||
if not isinstance(data, dict):
|
||||
continue
|
||||
ents = data.get("entities", {}) or {}
|
||||
for vqid, ent in ents.items():
|
||||
label = (ent.get("labels", {}).get("en", {}) or {}).get("value", "") or vqid
|
||||
for slot in qid_to_slots.get(vqid, []):
|
||||
facts.setdefault(slot, []).append(label)
|
||||
|
||||
# Deduplicate per slot, preserving order.
|
||||
deduped: dict[str, list[str]] = {}
|
||||
for slot, vals in facts.items():
|
||||
seen = set()
|
||||
out = []
|
||||
for v in vals:
|
||||
if v in seen:
|
||||
continue
|
||||
seen.add(v)
|
||||
out.append(v)
|
||||
deduped[slot] = out
|
||||
return deduped
|
||||
|
||||
|
||||
def _wd_qid_for_title(title: str) -> str:
|
||||
"""Get the Wikidata QID associated with a Wikipedia article title."""
|
||||
params = {
|
||||
"action": "query",
|
||||
"format": "json",
|
||||
"prop": "pageprops",
|
||||
"ppprop": "wikibase_item",
|
||||
"titles": title,
|
||||
"redirects": 1,
|
||||
}
|
||||
url = f"{WP_OPENSEARCH}?{urllib.parse.urlencode(params)}"
|
||||
try:
|
||||
data = get_json(url)
|
||||
except Exception: # noqa: BLE001
|
||||
return ""
|
||||
if not isinstance(data, dict):
|
||||
return ""
|
||||
pages = data.get("query", {}).get("pages", {}) or {}
|
||||
for page in pages.values():
|
||||
qid = (page.get("pageprops") or {}).get("wikibase_item", "")
|
||||
if qid:
|
||||
return qid
|
||||
return ""
|
||||
|
||||
|
||||
def fetch(query: str, limit: int, no_wikidata: bool, out_path: str) -> int:
|
||||
hits = _wp_search(query, limit)
|
||||
rows: list[dict[str, str]] = []
|
||||
for hit in hits[:limit]:
|
||||
title = hit.get("title", "")
|
||||
if not title:
|
||||
continue
|
||||
summary = _wp_summary(title)
|
||||
qid = _wd_qid_for_title(title) if not no_wikidata else ""
|
||||
facts: dict = {}
|
||||
if qid:
|
||||
facts = _wd_lookup_by_qid(qid)
|
||||
rows.append(
|
||||
{
|
||||
"source": "wikipedia+wikidata" if qid else "wikipedia",
|
||||
"label": title,
|
||||
"description": (summary.get("description") or hit.get("description") or "").strip(),
|
||||
"qid": qid,
|
||||
"wikipedia_title": title,
|
||||
"wikipedia_url": hit.get("url", ""),
|
||||
"wikidata_url": f"https://www.wikidata.org/wiki/{qid}" if qid else "",
|
||||
"instance_of": "; ".join(facts.get("instance_of", [])),
|
||||
"country": "; ".join(facts.get("country", [])),
|
||||
"occupation": "; ".join(facts.get("occupation", [])),
|
||||
"employer": "; ".join(facts.get("employer", [])),
|
||||
"date_of_birth": "; ".join(facts.get("date_of_birth", []))[:10] if facts.get("date_of_birth") else "",
|
||||
"place_of_birth": "; ".join(facts.get("place_of_birth", [])),
|
||||
"summary": (summary.get("extract") or "").replace("\n", " ")[:1000],
|
||||
}
|
||||
)
|
||||
|
||||
Path(out_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
with open(out_path, "w", newline="", encoding="utf-8") as fh:
|
||||
w = csv.DictWriter(fh, fieldnames=COLUMNS)
|
||||
w.writeheader()
|
||||
w.writerows(rows)
|
||||
if not rows:
|
||||
print(
|
||||
f"Wikipedia: 0 articles for query={query!r}. "
|
||||
"Private individuals not notable enough for a Wikipedia article "
|
||||
"won't appear here (the bar is real).",
|
||||
file=sys.stderr,
|
||||
)
|
||||
return len(rows)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--query", required=True, help="Entity name (person, company, place, concept)")
|
||||
p.add_argument("--limit", type=int, default=5)
|
||||
p.add_argument(
|
||||
"--no-wikidata",
|
||||
action="store_true",
|
||||
help="Skip the Wikidata SPARQL enrichment (faster, less detail)",
|
||||
)
|
||||
p.add_argument("--out", required=True)
|
||||
a = p.parse_args()
|
||||
n = fetch(query=a.query, limit=a.limit, no_wikidata=a.no_wikidata, out_path=a.out)
|
||||
print(f"Wrote {n} Wikipedia/Wikidata rows to {a.out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
|
|
@ -0,0 +1,253 @@
|
|||
#!/usr/bin/env python3
|
||||
"""Permutation test for donation/contract timing correlation (stdlib-only).
|
||||
|
||||
For each (donor, vendor) pair, compute the mean number of days between each
|
||||
donation and the nearest contract award. Then shuffle contract award dates
|
||||
N times within the observation window and compute the same statistic. The
|
||||
one-tailed p-value is the fraction of permutations whose mean is <= the
|
||||
observed mean (smaller distance = tighter clustering).
|
||||
|
||||
Adapted from ShinMegamiBoson/OpenPlanter (MIT). Differences:
|
||||
- Pure stdlib (no pandas / numpy)
|
||||
- Domain-agnostic (no snow-vendor / CRITICAL-politician filter)
|
||||
- Configurable column names via flags
|
||||
- Optional --seed for reproducibility
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import csv
|
||||
import datetime as dt
|
||||
import json
|
||||
import math
|
||||
import random
|
||||
import statistics
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
_DATE_FORMATS = ("%Y-%m-%d", "%m/%d/%Y", "%Y/%m/%d", "%m-%d-%Y", "%Y%m%d")
|
||||
|
||||
|
||||
def parse_date(raw: str) -> dt.date | None:
|
||||
if not raw:
|
||||
return None
|
||||
raw = raw.strip()
|
||||
for fmt in _DATE_FORMATS:
|
||||
try:
|
||||
return dt.datetime.strptime(raw, fmt).date()
|
||||
except ValueError:
|
||||
continue
|
||||
return None
|
||||
|
||||
|
||||
def _read(path: str) -> list[dict[str, str]]:
|
||||
with open(path, newline="", encoding="utf-8") as fh:
|
||||
return list(csv.DictReader(fh))
|
||||
|
||||
|
||||
def _nearest_distance(donation_date: dt.date, awards: list[dt.date]) -> int:
|
||||
"""Absolute days to nearest award date."""
|
||||
return min(abs((donation_date - a).days) for a in awards)
|
||||
|
||||
|
||||
def _permute(
|
||||
awards_count: int,
|
||||
donations: list[dt.date],
|
||||
date_min: dt.date,
|
||||
date_max: dt.date,
|
||||
rng: random.Random,
|
||||
) -> float:
|
||||
"""One permutation: draw uniform random award dates, compute mean nearest-distance."""
|
||||
span_days = (date_max - date_min).days or 1
|
||||
rand_awards = [
|
||||
date_min + dt.timedelta(days=rng.randint(0, span_days))
|
||||
for _ in range(awards_count)
|
||||
]
|
||||
distances = [_nearest_distance(d, rand_awards) for d in donations]
|
||||
return statistics.mean(distances)
|
||||
|
||||
|
||||
def analyze(
|
||||
donations_path: str,
|
||||
donation_date_col: str,
|
||||
donation_amount_col: str,
|
||||
donation_donor_col: str,
|
||||
donation_recipient_col: str,
|
||||
contracts_path: str,
|
||||
contract_date_col: str,
|
||||
contract_vendor_col: str,
|
||||
cross_links_path: str | None,
|
||||
n_permutations: int = 1000,
|
||||
min_donations: int = 3,
|
||||
p_threshold: float = 0.05,
|
||||
seed: int | None = None,
|
||||
out_path: str = "timing.json",
|
||||
) -> dict:
|
||||
rng = random.Random(seed)
|
||||
|
||||
donations = _read(donations_path)
|
||||
contracts = _read(contracts_path)
|
||||
|
||||
# Allow optional join through cross_links — donor (left) ↔ vendor (right).
|
||||
# When present, donor strings get mapped to matched vendor names so the
|
||||
# vendor-date index lookup actually finds the contracts.
|
||||
matched_pairs: set[tuple[str, str]] | None = None
|
||||
donor_to_vendors: dict[str, set[str]] = defaultdict(set)
|
||||
if cross_links_path:
|
||||
matched_pairs = set()
|
||||
for row in _read(cross_links_path):
|
||||
left = row.get("left_name", "")
|
||||
right = row.get("right_name", "")
|
||||
matched_pairs.add((left, right))
|
||||
donor_to_vendors[left].add(right)
|
||||
|
||||
# Index contract dates by vendor name.
|
||||
vendor_to_award_dates: dict[str, list[dt.date]] = defaultdict(list)
|
||||
all_award_dates: list[dt.date] = []
|
||||
for row in contracts:
|
||||
d = parse_date(row.get(contract_date_col, ""))
|
||||
if not d:
|
||||
continue
|
||||
vendor_to_award_dates[row.get(contract_vendor_col, "").strip()].append(d)
|
||||
all_award_dates.append(d)
|
||||
|
||||
if not all_award_dates:
|
||||
raise SystemExit(f"No parseable dates in {contracts_path}/{contract_date_col}")
|
||||
global_min = min(all_award_dates)
|
||||
global_max = max(all_award_dates)
|
||||
|
||||
# Group donations by (donor, recipient).
|
||||
grouped: dict[tuple[str, str], list[tuple[dt.date, float]]] = defaultdict(list)
|
||||
for row in donations:
|
||||
donor = row.get(donation_donor_col, "").strip()
|
||||
recip = row.get(donation_recipient_col, "").strip()
|
||||
d = parse_date(row.get(donation_date_col, ""))
|
||||
try:
|
||||
amt = float(row.get(donation_amount_col, "0") or 0)
|
||||
except ValueError:
|
||||
amt = 0.0
|
||||
if not (donor and recip and d):
|
||||
continue
|
||||
grouped[(donor, recip)].append((d, amt))
|
||||
|
||||
results = []
|
||||
skipped = 0
|
||||
for (donor, recip), records in grouped.items():
|
||||
if len(records) < min_donations:
|
||||
skipped += 1
|
||||
continue
|
||||
# Only test if donor appears in cross-links (when provided). The
|
||||
# (donor, candidate) tuple itself is NOT what's in matched_pairs —
|
||||
# cross_links pairs are (donor, vendor). We use the cross-link to
|
||||
# map donor → vendor name(s) so the vendor-date index resolves.
|
||||
if matched_pairs is not None and donor not in donor_to_vendors:
|
||||
skipped += 1
|
||||
continue
|
||||
# Try direct donor→awards first, then go through cross-link vendor names.
|
||||
award_dates = list(vendor_to_award_dates.get(donor, []))
|
||||
if not award_dates:
|
||||
award_dates = list(vendor_to_award_dates.get(recip, []))
|
||||
if not award_dates and donor_to_vendors.get(donor):
|
||||
for vendor_name in donor_to_vendors[donor]:
|
||||
award_dates.extend(vendor_to_award_dates.get(vendor_name, []))
|
||||
if not award_dates:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
donation_dates = [d for (d, _) in records]
|
||||
observed = statistics.mean(
|
||||
_nearest_distance(d, award_dates) for d in donation_dates
|
||||
)
|
||||
|
||||
permuted_means = [
|
||||
_permute(len(award_dates), donation_dates, global_min, global_max, rng)
|
||||
for _ in range(n_permutations)
|
||||
]
|
||||
p_value = sum(1 for m in permuted_means if m <= observed) / n_permutations
|
||||
null_mean = statistics.mean(permuted_means)
|
||||
null_std = statistics.pstdev(permuted_means) or 1.0
|
||||
effect_size = (null_mean - observed) / null_std
|
||||
|
||||
results.append(
|
||||
{
|
||||
"donor": donor,
|
||||
"recipient": recip,
|
||||
"n_donations": len(records),
|
||||
"n_award_dates": len(award_dates),
|
||||
"observed_mean_days": round(observed, 2),
|
||||
"null_mean_days": round(null_mean, 2),
|
||||
"p_value": round(p_value, 4),
|
||||
"effect_size_sd": round(effect_size, 2),
|
||||
"significant": p_value < p_threshold,
|
||||
"total_donation_amount": round(sum(a for (_, a) in records), 2),
|
||||
}
|
||||
)
|
||||
|
||||
results.sort(key=lambda r: r["p_value"])
|
||||
|
||||
payload = {
|
||||
"metadata": {
|
||||
"n_permutations": n_permutations,
|
||||
"min_donations": min_donations,
|
||||
"p_threshold": p_threshold,
|
||||
"seed": seed,
|
||||
"n_pairs_tested": len(results),
|
||||
"n_pairs_skipped": skipped,
|
||||
"n_significant": sum(1 for r in results if r["significant"]),
|
||||
"observation_window": [global_min.isoformat(), global_max.isoformat()],
|
||||
},
|
||||
"results": results,
|
||||
}
|
||||
|
||||
Path(out_path).write_text(json.dumps(payload, indent=2))
|
||||
return payload
|
||||
|
||||
|
||||
def main() -> int:
|
||||
p = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
p.add_argument("--donations", required=True)
|
||||
p.add_argument("--donation-date-col", required=True)
|
||||
p.add_argument("--donation-amount-col", required=True)
|
||||
p.add_argument("--donation-donor-col", required=True)
|
||||
p.add_argument("--donation-recipient-col", required=True)
|
||||
p.add_argument("--contracts", required=True)
|
||||
p.add_argument("--contract-date-col", required=True)
|
||||
p.add_argument("--contract-vendor-col", required=True)
|
||||
p.add_argument(
|
||||
"--cross-links",
|
||||
help="Optional cross_links.csv to restrict (donor, vendor) pairs",
|
||||
)
|
||||
p.add_argument("--permutations", type=int, default=1000)
|
||||
p.add_argument("--min-donations", type=int, default=3)
|
||||
p.add_argument("--p-threshold", type=float, default=0.05)
|
||||
p.add_argument("--seed", type=int)
|
||||
p.add_argument("--out", default="timing.json")
|
||||
a = p.parse_args()
|
||||
|
||||
payload = analyze(
|
||||
donations_path=a.donations,
|
||||
donation_date_col=a.donation_date_col,
|
||||
donation_amount_col=a.donation_amount_col,
|
||||
donation_donor_col=a.donation_donor_col,
|
||||
donation_recipient_col=a.donation_recipient_col,
|
||||
contracts_path=a.contracts,
|
||||
contract_date_col=a.contract_date_col,
|
||||
contract_vendor_col=a.contract_vendor_col,
|
||||
cross_links_path=a.cross_links,
|
||||
n_permutations=a.permutations,
|
||||
min_donations=a.min_donations,
|
||||
p_threshold=a.p_threshold,
|
||||
seed=a.seed,
|
||||
out_path=a.out,
|
||||
)
|
||||
meta = payload["metadata"]
|
||||
print(
|
||||
f"Tested {meta['n_pairs_tested']} pairs ({meta['n_pairs_skipped']} skipped). "
|
||||
f"Significant (p<{meta['p_threshold']}): {meta['n_significant']}. "
|
||||
f"Wrote {a.out}"
|
||||
)
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
Loading…
Add table
Add a link
Reference in a new issue