# GDELT — Global News Monitoring

## 1. Summary

GDELT (Global Database of Events, Language, and Tone) monitors world news
in 100+ languages with full-text indexing. Updated every 15 minutes.
~2015 → present, ~1B+ articles indexed. Free anonymous access.

GDELT is wider than Google News (more international, more long-tail
sources) and indexed by tone/sentiment, themes (CAMEO codes), people, and
organizations.

## 2. Access Methods

- **DOC 2.0 API:** `https://api.gdeltproject.org/api/v2/doc/doc`
- **Events / GKG 2.0:** `https://api.gdeltproject.org/api/v2/events/events`
- **Auth:** None
- **Rate limit:** **1 request per 5 seconds** for the DOC API — strict

The fetch script automatically retries after a 6-second sleep when a
429 is received.

## 3. Data Schema

Key fields emitted by `fetch_gdelt.py`:

| Column | Type | Description |
|--------|------|-------------|
| `title` | str | Article title |
| `url` | str | Article URL |
| `seen_date` | str | When GDELT first saw the article (UTC) |
| `domain` | str | Publisher domain |
| `language` | str | Source language |
| `source_country` | str | 2-letter country code |
| `tone` | str | GDELT-computed tone score (negative = negative coverage) |
| `social_image` | str | Open Graph image URL when available |

## 4. Coverage

- Worldwide news in 100+ languages
- ~2015 → present (Events back to 1979 via a separate stream)
- Update frequency: 15 minutes
- Bias: heavily Anglophone in volume but very wide source list overall

## 5. Cross-Reference Potential

- **All sources** ↔ `title` / `url` (news context for any subject)
- **Wikipedia** ↔ event timeline for notable entities
- **Wayback Machine** ↔ recover articles whose URLs have died
- **OFAC SDN** ↔ news context for sanctions designations
- **SEC EDGAR** ↔ news context for 8-K material events

Join key: entity name appearing in article title or full-text. GDELT also
extracts named entities into a separate stream (GKG) not exposed by this
fetcher — query GDELT directly for entity-level filtering.

## 6. Data Quality

- Title extraction is automated and can be wrong (sometimes captures the
  site name + delimiter + article title; sometimes a generic page title)
- Sentiment / tone is computed by GDELT, not source-supplied
- Some domains are oversampled (newswires, aggregators)
- Source country is inferred from domain registration / TLD — can be
  wrong for international news sites with country-neutral domains
- Article URLs can rot — pair with Wayback Machine to preserve content

## 7. Acquisition Script

Path: `scripts/fetch_gdelt.py`

```bash
# Recent news mentioning an entity
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Nous Research" \
    --timespan 6m --out data/gdelt.csv

# Phrase-exact (use double quotes inside single quotes for the shell)
python3 SKILL_DIR/scripts/fetch_gdelt.py --query '"Dillon Rolnick"' \
    --timespan 1y --out data/gdelt.csv

# Filter to a country / language
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
    --source-country US --source-lang English --out data/gdelt.csv

# Date range
python3 SKILL_DIR/scripts/fetch_gdelt.py --query "Microsoft" \
    --start 2024-01-01 --end 2024-12-31 --out data/gdelt.csv
```

GDELT supports its own query operators: phrase quoting, AND/OR/NOT,
`sourcecountry:US`, `theme:ECON_BANKRUPTCY`, `tone<-5`, etc.
See https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/ for syntax.

## 8. Legal & Licensing

- GDELT data is provided free for academic and journalistic use
- Article URLs link out to original publishers — copyright remains with
  the publisher
- GDELT is NOT a content archive; it's a metadata index

## 9. References

- DOC 2.0 API: https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
- Themes & query syntax: https://blog.gdeltproject.org/gkg-2-0-our-global-knowledge-graph-2-0-amazing-data-at-your-fingertips/
- Project home: https://www.gdeltproject.org/