feat(llm-wiki): port provenance markers, source hashing, and quality signals from llm-wiki-compiler (#13700)

Three additive conventions inspired by github.com/atomicmemory/llm-wiki-compiler:

- Paragraph-level provenance: `^[raw/articles/source.md]` markers on pages synthesizing 3+ sources, so readers can trace individual claims without re-reading full source files.
- Raw source content hashing: `sha256:` in raw/ frontmatter enables re-ingest drift detection — skip unchanged sources, flag changed ones.
- Optional `confidence` and `contested` frontmatter fields let lint surface weak or disputed claims without re-reading every page's prose.

Lint gains two new checks (quality signals, source drift) and one expanded check (contradictions now surfaces frontmatter-flagged pages).

Also adds a Related Tools section pointing users who want batch/scheduled compilation at llm-wiki-compiler (Obsidian-compatible, works on the same vault).

All additions are opt-in — existing wikis need no migration. Skill version 2.0.0 -> 2.1.0.
This commit is contained in:
Teknium 2026-04-21 14:56:34 -07:00 committed by GitHub
parent 52cbceea44
commit 9fa49206dc
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -1,7 +1,7 @@
--- ---
name: llm-wiki name: llm-wiki
description: "Karpathy's LLM Wiki — build and maintain a persistent, interlinked markdown knowledge base. Ingest sources, query compiled knowledge, and lint for consistency." description: "Karpathy's LLM Wiki — build and maintain a persistent, interlinked markdown knowledge base. Ingest sources, query compiled knowledge, and lint for consistency."
version: 2.0.0 version: 2.1.0
author: Hermes Agent author: Hermes Agent
license: MIT license: MIT
metadata: metadata:
@ -122,6 +122,10 @@ Adapt to the user's domain. The schema constrains agent behavior and ensures con
- When updating a page, always bump the `updated` date - When updating a page, always bump the `updated` date
- Every new page must be added to `index.md` under the correct section - Every new page must be added to `index.md` under the correct section
- Every action must be appended to `log.md` - Every action must be appended to `log.md`
- **Provenance markers:** On pages that synthesize 3+ sources, append `^[raw/articles/source-file.md]`
at the end of paragraphs whose claims come from a specific source. This lets a reader trace each
claim back without re-reading the whole raw file. Optional on single-source pages where the
`sources:` frontmatter is enough.
## Frontmatter ## Frontmatter
```yaml ```yaml
@ -132,9 +136,33 @@ Adapt to the user's domain. The schema constrains agent behavior and ensures con
type: entity | concept | comparison | query | summary type: entity | concept | comparison | query | summary
tags: [from taxonomy below] tags: [from taxonomy below]
sources: [raw/articles/source-name.md] sources: [raw/articles/source-name.md]
# Optional quality signals:
confidence: high | medium | low # how well-supported the claims are
contested: true # set when the page has unresolved contradictions
contradictions: [other-page-slug] # pages this one conflicts with
--- ---
``` ```
`confidence` and `contested` are optional but recommended for opinion-heavy or fast-moving
topics. Lint surfaces `contested: true` and `confidence: low` pages for review so weak claims
don't silently harden into accepted wiki fact.
### raw/ Frontmatter
Raw sources ALSO get a small frontmatter block so re-ingests can detect drift:
```yaml
---
source_url: https://example.com/article # original URL, if applicable
ingested: YYYY-MM-DD
sha256: <hex digest of the raw content below the frontmatter>
---
```
The `sha256:` lets a future re-ingest of the same URL skip processing when content is unchanged,
and flag drift when it has changed. Compute over the body only (everything after the closing
`---`), not the frontmatter itself.
## Tag Taxonomy ## Tag Taxonomy
[Define 10-20 top-level tags for the domain. Add new tags here BEFORE using them.] [Define 10-20 top-level tags for the domain. Add new tags here BEFORE using them.]
@ -234,6 +262,10 @@ When the user provides a source (URL, file, paste), integrate it into the wiki:
- PDF → use `web_extract` (handles PDFs), save to `raw/papers/` - PDF → use `web_extract` (handles PDFs), save to `raw/papers/`
- Pasted text → save to appropriate `raw/` subdirectory - Pasted text → save to appropriate `raw/` subdirectory
- Name the file descriptively: `raw/articles/karpathy-llm-wiki-2026.md` - Name the file descriptively: `raw/articles/karpathy-llm-wiki-2026.md`
- **Add raw frontmatter** (`source_url`, `ingested`, `sha256` of the body).
On re-ingest of the same URL: recompute the sha256, compare to the stored value —
skip if identical, flag drift and update if different. This is cheap enough to
do on every re-ingest and catches silent source changes.
**Discuss takeaways** with the user — what's interesting, what matters for **Discuss takeaways** with the user — what's interesting, what matters for
the domain. (Skip this in automated/cron contexts — proceed directly.) the domain. (Skip this in automated/cron contexts — proceed directly.)
@ -250,6 +282,11 @@ When the user provides a source (URL, file, paste), integrate it into the wiki:
- **Cross-reference:** Every new or updated page must link to at least 2 other - **Cross-reference:** Every new or updated page must link to at least 2 other
pages via `[[wikilinks]]`. Check that existing pages link back. pages via `[[wikilinks]]`. Check that existing pages link back.
- **Tags:** Only use tags from the taxonomy in SCHEMA.md - **Tags:** Only use tags from the taxonomy in SCHEMA.md
- **Provenance:** On pages synthesizing 3+ sources, append `^[raw/articles/source.md]`
markers to paragraphs whose claims trace to a specific source.
- **Confidence:** For opinion-heavy, fast-moving, or single-source claims, set
`confidence: medium` or `low` in frontmatter. Don't mark `high` unless the
claim is well-supported across multiple sources.
⑤ **Update navigation:** ⑤ **Update navigation:**
- Add new pages to `index.md` under the correct section, alphabetically - Add new pages to `index.md` under the correct section, alphabetically
@ -304,18 +341,28 @@ wiki = "<WIKI_PATH>"
recent source that mentions the same entities. recent source that mentions the same entities.
**Contradictions:** Pages on the same topic with conflicting claims. Look for **Contradictions:** Pages on the same topic with conflicting claims. Look for
pages that share tags/entities but state different facts. pages that share tags/entities but state different facts. Surface all pages
with `contested: true` or `contradictions:` frontmatter for user review.
**Page size:** Flag pages over 200 lines — candidates for splitting. **Quality signals:** List pages with `confidence: low` and any page that cites
only a single source but has no confidence field set — these are candidates
for either finding corroboration or demoting to `confidence: medium`.
**Tag audit:** List all tags in use, flag any not in the SCHEMA.md taxonomy. **Source drift:** For each file in `raw/` with a `sha256:` frontmatter, recompute
the hash and flag mismatches. Mismatches indicate the raw file was edited
(shouldn't happen — raw/ is immutable) or ingested from a URL that has since
changed. Not a hard error, but worth reporting.
**Log rotation:** If log.md exceeds 500 entries, rotate it. **Page size:** Flag pages over 200 lines — candidates for splitting.
**Report findings** with specific file paths and suggested actions, grouped by **Tag audit:** List all tags in use, flag any not in the SCHEMA.md taxonomy.
severity (broken links > orphans > stale content > style issues).
**Append to log.md:** `## [YYYY-MM-DD] lint | N issues found` **Log rotation:** If log.md exceeds 500 entries, rotate it.
**Report findings** with specific file paths and suggested actions, grouped by
severity (broken links > orphans > source drift > contested pages > stale content > style issues).
**Append to log.md:** `## [YYYY-MM-DD] lint | N issues found`
## Working with the Wiki ## Working with the Wiki
@ -448,3 +495,12 @@ vault in Obsidian on your laptop/phone — changes appear within seconds.
The agent should check log size during lint. The agent should check log size during lint.
- **Handle contradictions explicitly** — don't silently overwrite. Note both claims with dates, - **Handle contradictions explicitly** — don't silently overwrite. Note both claims with dates,
mark in frontmatter, flag for user review. mark in frontmatter, flag for user review.
## Related Tools
[llm-wiki-compiler](https://github.com/atomicmemory/llm-wiki-compiler) is a Node.js CLI that
compiles sources into a concept wiki with the same Karpathy inspiration. It's Obsidian-compatible,
so users who want a scheduled/CLI-driven compile pipeline can point it at the same vault this
skill maintains. Trade-offs: it owns page generation (replaces the agent's judgment on page
creation) and is tuned for small corpora. Use this skill when you want agent-in-the-loop curation;
use llmwiki when you want batch compile of a source directory.