docs: deep quality pass — expand 10 thin pages, fix specific issues (#4134)

Developer guide stubs expanded to full documentation:
- trajectory-format.md: 56→233 lines (JSONL format, ShareGPT example,
  normalization rules, reasoning markup, replay code)
- session-storage.md: 66→388 lines (SQLite schema, migration table,
  FTS5 search syntax, lineage queries, Python API examples)
- context-compression-and-caching.md: 72→321 lines (dual compression
  system, config defaults, 4-phase algorithm, before/after example,
  prompt caching mechanics, cache-aware patterns)
- tools-runtime.md: 65→246 lines (registry API, dispatch flow,
  availability checking, error wrapping, approval flow)
- prompt-assembly.md: 89→246 lines (concrete assembled prompt example,
  SOUL.md injection, context file discovery table)

User-facing pages expanded:
- docker.md: 62→224 lines (volumes, env forwarding, docker-compose,
  resource limits, troubleshooting)
- updating.md: 79→167 lines (update behavior, version checking,
  rollback instructions, Nix users)
- skins.md: 80→206 lines (all color/spinner/branding keys, built-in
  skin descriptions, full custom skin YAML template)

Hub pages improved:
- integrations/index.md: 25→82 lines (web search backends table,
  TTS/browser providers, quick config example)
- features/overview.md: added Integrations section with 6 missing links

Specific fixes:
- configuration.md: removed duplicate Gateway Streaming section
- mcp.md: removed internal "PR work" language
- plugins.md: added inline minimal plugin example (self-contained)

13 files changed, ~1700 lines added. Docusaurus build verified clean.
This commit is contained in:
Teknium 2026-03-30 20:30:11 -07:00 committed by GitHub
parent 54b876a5c9
commit 5b0243e6ad
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 1735 additions and 174 deletions

View file

@ -1,72 +1,321 @@
---
sidebar_position: 6
title: "Context Compression & Prompt Caching"
description: "How Hermes compresses long conversations and applies provider-side prompt caching"
---
# Context Compression and Caching
# Context Compression & Prompt Caching
Hermes Agent uses a dual compression system and Anthropic prompt caching to
manage context window usage efficiently across long conversations.
Hermes manages long conversations with two complementary mechanisms:
Source files: `agent/context_compressor.py`, `agent/prompt_caching.py`,
`gateway/run.py` (session hygiene), `run_agent.py` (lines 1146-1204)
- prompt caching
- context compression
Primary files:
## Dual Compression System
- `agent/prompt_caching.py`
- `agent/context_compressor.py`
- `run_agent.py`
Hermes has two separate compression layers that operate independently:
## Prompt caching
```
┌──────────────────────────┐
Incoming message │ Gateway Session Hygiene │ Fires at 85% of context
─────────────────► │ (pre-agent, rough est.) │ Safety net for large sessions
└─────────────┬────────────┘
┌──────────────────────────┐
│ Agent ContextCompressor │ Fires at 50% of context (default)
│ (in-loop, real tokens) │ Normal context management
└──────────────────────────┘
```
For Anthropic/native and Claude-via-OpenRouter flows, Hermes applies Anthropic-style cache markers.
### 1. Gateway Session Hygiene (85% threshold)
Current strategy:
Located in `gateway/run.py` (around line 2220). This is a **safety net** that
runs before the agent processes a message. It prevents API failures when sessions
grow too large between turns (e.g., overnight accumulation in Telegram/Discord).
- cache the system prompt
- cache the last 3 non-system messages
- default TTL is 5 minutes unless explicitly extended
- **Threshold**: Fixed at 85% of model context length
- **Token source**: Prefers actual API-reported tokens from last turn; falls back
to rough character-based estimate (`estimate_messages_tokens_rough`)
- **Fires**: Only when `len(history) >= 4` and compression is enabled
- **Purpose**: Catch sessions that escaped the agent's own compressor
This is implemented in `agent/prompt_caching.py`.
The gateway hygiene threshold is intentionally higher than the agent's compressor.
Setting it at 50% (same as the agent) caused premature compression on every turn
in long gateway sessions.
## Why prompt stability matters
### 2. Agent ContextCompressor (50% threshold, configurable)
Prompt caching only helps when the stable prefix remains stable. That is why Hermes avoids rebuilding or mutating the core system prompt mid-session unless it has to.
Located in `agent/context_compressor.py`. This is the **primary compression
system** that runs inside the agent's tool loop with access to accurate,
API-reported token counts.
## Compression trigger
Hermes can compress context when conversations become large. Configuration defaults live in `config.yaml`, and the compressor also has runtime checks based on actual prompt token counts.
## Configuration
## Compression algorithm
All compression settings are read from `config.yaml` under the `compression` key:
The compressor protects:
```yaml
compression:
enabled: true # Enable/disable compression (default: true)
threshold: 0.50 # Fraction of context window (default: 0.50 = 50%)
target_ratio: 0.20 # How much of threshold to keep as tail (default: 0.20)
protect_last_n: 20 # Minimum protected tail messages (default: 20)
summary_model: null # Override model for summaries (default: uses auxiliary)
```
- the first N turns
- the last N turns
### Parameter Details
and summarizes the middle section.
| Parameter | Default | Range | Description |
|-----------|---------|-------|-------------|
| `threshold` | `0.50` | 0.0-1.0 | Compression triggers when prompt tokens ≥ `threshold × context_length` |
| `target_ratio` | `0.20` | 0.10-0.80 | Controls tail protection token budget: `threshold_tokens × target_ratio` |
| `protect_last_n` | `20` | ≥1 | Minimum number of recent messages always preserved |
| `protect_first_n` | `3` | (hardcoded) | System prompt + first exchange always preserved |
It also cleans up structural issues such as orphaned tool-call/result pairs so the API never receives invalid conversation structure after compression.
### Computed Values (for a 200K context model at defaults)
## Pre-compression memory flush
```
context_length = 200,000
threshold_tokens = 200,000 × 0.50 = 100,000
tail_token_budget = 100,000 × 0.20 = 20,000
max_summary_tokens = min(200,000 × 0.05, 12,000) = 10,000
```
Before compression, Hermes can give the model one last chance to persist memory so facts are not lost when middle turns are summarized away.
## Session lineage after compression
## Compression Algorithm
Compression can split the session into a new session ID while preserving parent lineage in the state DB.
The `ContextCompressor.compress()` method follows a 4-phase algorithm:
This lets Hermes continue operating with a smaller active context while retaining a searchable ancestry chain.
### Phase 1: Prune Old Tool Results (cheap, no LLM call)
## Re-injected state after compression
Old tool results (>200 chars) outside the protected tail are replaced with:
```
[Old tool output cleared to save context space]
```
After compression, Hermes may re-inject compact operational state such as:
This is a cheap pre-pass that saves significant tokens from verbose tool
outputs (file contents, terminal output, search results).
- todo snapshot
- prior-read-files summary
### Phase 2: Determine Boundaries
## Related docs
```
┌─────────────────────────────────────────────────────────────┐
│ Message list │
│ │
│ [0..2] ← protect_first_n (system + first exchange) │
│ [3..N] ← middle turns → SUMMARIZED │
│ [N..end] ← tail (by token budget OR protect_last_n) │
│ │
└─────────────────────────────────────────────────────────────┘
```
- [Prompt Assembly](./prompt-assembly.md)
- [Session Storage](./session-storage.md)
- [Agent Loop Internals](./agent-loop.md)
Tail protection is **token-budget based**: walks backward from the end,
accumulating tokens until the budget is exhausted. Falls back to the fixed
`protect_last_n` count if the budget would protect fewer messages.
Boundaries are aligned to avoid splitting tool_call/tool_result groups.
The `_align_boundary_backward()` method walks past consecutive tool results
to find the parent assistant message, keeping groups intact.
### Phase 3: Generate Structured Summary
The middle turns are summarized using the auxiliary LLM with a structured
template:
```
## Goal
[What the user is trying to accomplish]
## Constraints & Preferences
[User preferences, coding style, constraints, important decisions]
## Progress
### Done
[Completed work — specific file paths, commands run, results]
### In Progress
[Work currently underway]
### Blocked
[Any blockers or issues encountered]
## Key Decisions
[Important technical decisions and why]
## Relevant Files
[Files read, modified, or created — with brief note on each]
## Next Steps
[What needs to happen next]
## Critical Context
[Specific values, error messages, configuration details]
```
Summary budget scales with the amount of content being compressed:
- Formula: `content_tokens × 0.20` (the `_SUMMARY_RATIO` constant)
- Minimum: 2,000 tokens
- Maximum: `min(context_length × 0.05, 12,000)` tokens
### Phase 4: Assemble Compressed Messages
The compressed message list is:
1. Head messages (with a note appended to system prompt on first compression)
2. Summary message (role chosen to avoid consecutive same-role violations)
3. Tail messages (unmodified)
Orphaned tool_call/tool_result pairs are cleaned up by `_sanitize_tool_pairs()`:
- Tool results referencing removed calls → removed
- Tool calls whose results were removed → stub result injected
### Iterative Re-compression
On subsequent compressions, the previous summary is passed to the LLM with
instructions to **update** it rather than summarize from scratch. This preserves
information across multiple compactions — items move from "In Progress" to "Done",
new progress is added, and obsolete information is removed.
The `_previous_summary` field on the compressor instance stores the last summary
text for this purpose.
## Before/After Example
### Before Compression (45 messages, ~95K tokens)
```
[0] system: "You are a helpful assistant..." (system prompt)
[1] user: "Help me set up a FastAPI project"
[2] assistant: <tool_call> terminal: mkdir project </tool_call>
[3] tool: "directory created"
[4] assistant: <tool_call> write_file: main.py </tool_call>
[5] tool: "file written (2.3KB)"
... 30 more turns of file editing, testing, debugging ...
[38] assistant: <tool_call> terminal: pytest </tool_call>
[39] tool: "8 passed, 2 failed\n..." (5KB output)
[40] user: "Fix the failing tests"
[41] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[42] tool: "import pytest\n..." (3KB)
[43] assistant: "I see the issue with the test fixtures..."
[44] user: "Great, also add error handling"
```
### After Compression (25 messages, ~45K tokens)
```
[0] system: "You are a helpful assistant...
[Note: Some earlier conversation turns have been compacted...]"
[1] user: "Help me set up a FastAPI project"
[2] assistant: "[CONTEXT COMPACTION] Earlier turns were compacted...
## Goal
Set up a FastAPI project with tests and error handling
## Progress
### Done
- Created project structure: main.py, tests/, requirements.txt
- Implemented 5 API endpoints in main.py
- Wrote 10 test cases in tests/test_api.py
- 8/10 tests passing
### In Progress
- Fixing 2 failing tests (test_create_user, test_delete_user)
## Relevant Files
- main.py — FastAPI app with 5 endpoints
- tests/test_api.py — 10 test cases
- requirements.txt — fastapi, pytest, httpx
## Next Steps
- Fix failing test fixtures
- Add error handling"
[3] user: "Fix the failing tests"
[4] assistant: <tool_call> read_file: tests/test_api.py </tool_call>
[5] tool: "import pytest\n..."
[6] assistant: "I see the issue with the test fixtures..."
[7] user: "Great, also add error handling"
```
## Prompt Caching (Anthropic)
Source: `agent/prompt_caching.py`
Reduces input token costs by ~75% on multi-turn conversations by caching the
conversation prefix. Uses Anthropic's `cache_control` breakpoints.
### Strategy: system_and_3
Anthropic allows a maximum of 4 `cache_control` breakpoints per request. Hermes
uses the "system_and_3" strategy:
```
Breakpoint 1: System prompt (stable across all turns)
Breakpoint 2: 3rd-to-last non-system message ─┐
Breakpoint 3: 2nd-to-last non-system message ├─ Rolling window
Breakpoint 4: Last non-system message ─┘
```
### How It Works
`apply_anthropic_cache_control()` deep-copies the messages and injects
`cache_control` markers:
```python
# Cache marker format
marker = {"type": "ephemeral"}
# Or for 1-hour TTL:
marker = {"type": "ephemeral", "ttl": "1h"}
```
The marker is applied differently based on content type:
| Content Type | Where Marker Goes |
|-------------|-------------------|
| String content | Converted to `[{"type": "text", "text": ..., "cache_control": ...}]` |
| List content | Added to the last element's dict |
| None/empty | Added as `msg["cache_control"]` |
| Tool messages | Added as `msg["cache_control"]` (native Anthropic only) |
### Cache-Aware Design Patterns
1. **Stable system prompt**: The system prompt is breakpoint 1 and cached across
all turns. Avoid mutating it mid-conversation (compression appends a note
only on the first compaction).
2. **Message ordering matters**: Cache hits require prefix matching. Adding or
removing messages in the middle invalidates the cache for everything after.
3. **Compression cache interaction**: After compression, the cache is invalidated
for the compressed region but the system prompt cache survives. The rolling
3-message window re-establishes caching within 1-2 turns.
4. **TTL selection**: Default is `5m` (5 minutes). Use `1h` for long-running
sessions where the user takes breaks between turns.
### Enabling Prompt Caching
Prompt caching is automatically enabled when:
- The model is an Anthropic Claude model (detected by model name)
- The provider supports `cache_control` (native Anthropic API or OpenRouter)
```yaml
# config.yaml — TTL is configurable
model:
cache_ttl: "5m" # "5m" or "1h"
```
The CLI shows caching status at startup:
```
💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL)
```
## Context Pressure Warnings
The agent emits context pressure warnings at 85% of the compression threshold
(not 85% of context — 85% of the threshold which is itself 50% of context):
```
⚠️ Context is 85% to compaction threshold (42,500/50,000 tokens)
```
After compression, if usage drops below 85% of threshold, the warning state
is cleared. If compression fails to reduce below the warning level (the
conversation is too dense), the warning persists but compression won't
re-trigger until the threshold is exceeded again.

View file

@ -41,6 +41,163 @@ The cached system prompt is assembled in roughly this order:
When `skip_context_files` is set (e.g., subagent delegation), SOUL.md is not loaded and the hardcoded `DEFAULT_AGENT_IDENTITY` is used instead.
### Concrete example: assembled system prompt
Here is a simplified view of what the final system prompt looks like when all layers are present (comments show the source of each section):
```
# Layer 1: Agent Identity (from ~/.hermes/SOUL.md)
You are Hermes, an AI assistant created by Nous Research.
You are an expert software engineer and researcher.
You value correctness, clarity, and efficiency.
...
# Layer 2: Tool-aware behavior guidance
You have persistent memory across sessions. Save durable facts using
the memory tool: user preferences, environment details, tool quirks,
and stable conventions. Memory is injected into every turn, so keep
it compact and focused on facts that will still matter later.
...
When the user references something from a past conversation or you
suspect relevant cross-session context exists, use session_search
to recall it before asking them to repeat themselves.
# Tool-use enforcement (for GPT/Codex models only)
You MUST use your tools to take action — do not describe what you
would do or plan to do without actually doing it.
...
# Layer 3: Honcho static block (when active)
[Honcho personality/context data]
# Layer 4: Optional system message (from config or API)
[User-configured system message override]
# Layer 5: Frozen MEMORY snapshot
## Persistent Memory
- User prefers Python 3.12, uses pyproject.toml
- Default editor is nvim
- Working on project "atlas" in ~/code/atlas
- Timezone: US/Pacific
# Layer 6: Frozen USER profile snapshot
## User Profile
- Name: Alice
- GitHub: alice-dev
# Layer 7: Skills index
## Skills (mandatory)
Before replying, scan the skills below. If one clearly matches
your task, load it with skill_view(name) and follow its instructions.
...
<available_skills>
software-development:
- code-review: Structured code review workflow
- test-driven-development: TDD methodology
research:
- arxiv: Search and summarize arXiv papers
</available_skills>
# Layer 8: Context files (from project directory)
# Project Context
The following project context files have been loaded and should be followed:
## AGENTS.md
This is the atlas project. Use pytest for testing. The main
entry point is src/atlas/main.py. Always run `make lint` before
committing.
# Layer 9: Timestamp + session
Current time: 2026-03-30T14:30:00-07:00
Session: abc123
# Layer 10: Platform hint
You are a CLI AI Agent. Try not to use markdown but simple text
renderable inside a terminal.
```
## How SOUL.md appears in the prompt
`SOUL.md` lives at `~/.hermes/SOUL.md` and serves as the agent's identity — the very first section of the system prompt. The loading logic in `prompt_builder.py` works as follows:
```python
# From agent/prompt_builder.py (simplified)
def load_soul_md() -> Optional[str]:
soul_path = get_hermes_home() / "SOUL.md"
if not soul_path.exists():
return None
content = soul_path.read_text(encoding="utf-8").strip()
content = _scan_context_content(content, "SOUL.md") # Security scan
content = _truncate_content(content, "SOUL.md") # Cap at 20k chars
return content
```
When `load_soul_md()` returns content, it replaces the hardcoded `DEFAULT_AGENT_IDENTITY`. The `build_context_files_prompt()` function is then called with `skip_soul=True` to prevent SOUL.md from appearing twice (once as identity, once as a context file).
If `SOUL.md` doesn't exist, the system falls back to:
```
You are Hermes Agent, an intelligent AI assistant created by Nous Research.
You are helpful, knowledgeable, and direct. You assist users with a wide
range of tasks including answering questions, writing and editing code,
analyzing information, creative work, and executing actions via your tools.
You communicate clearly, admit uncertainty when appropriate, and prioritize
being genuinely useful over being verbose unless otherwise directed below.
Be targeted and efficient in your exploration and investigations.
```
## How context files are injected
`build_context_files_prompt()` uses a **priority system** — only one project context type is loaded (first match wins):
```python
# From agent/prompt_builder.py (simplified)
def build_context_files_prompt(cwd=None, skip_soul=False):
cwd_path = Path(cwd).resolve()
# Priority: first match wins — only ONE project context loaded
project_context = (
_load_hermes_md(cwd_path) # 1. .hermes.md / HERMES.md (walks to git root)
or _load_agents_md(cwd_path) # 2. AGENTS.md (cwd only)
or _load_claude_md(cwd_path) # 3. CLAUDE.md (cwd only)
or _load_cursorrules(cwd_path) # 4. .cursorrules / .cursor/rules/*.mdc
)
sections = []
if project_context:
sections.append(project_context)
# SOUL.md from HERMES_HOME (independent of project context)
if not skip_soul:
soul_content = load_soul_md()
if soul_content:
sections.append(soul_content)
if not sections:
return ""
return (
"# Project Context\n\n"
"The following project context files have been loaded "
"and should be followed:\n\n"
+ "\n".join(sections)
)
```
### Context file discovery details
| Priority | Files | Search scope | Notes |
|----------|-------|-------------|-------|
| 1 | `.hermes.md`, `HERMES.md` | CWD up to git root | Hermes-native project config |
| 2 | `AGENTS.md` | CWD only | Common agent instruction file |
| 3 | `CLAUDE.md` | CWD only | Claude Code compatibility |
| 4 | `.cursorrules`, `.cursor/rules/*.mdc` | CWD only | Cursor compatibility |
All context files are:
- **Security scanned** — checked for prompt injection patterns (invisible unicode, "ignore previous instructions", credential exfiltration attempts)
- **Truncated** — capped at 20,000 characters using 70/20 head/tail ratio with a truncation marker
- **YAML frontmatter stripped**`.hermes.md` frontmatter is removed (reserved for future config overrides)
## API-call-time-only layers
These are intentionally *not* persisted as part of the cached system prompt:

View file

@ -1,66 +1,388 @@
---
sidebar_position: 8
title: "Session Storage"
description: "How Hermes stores sessions in SQLite, maintains lineage, and exposes recall/search"
---
# Session Storage
Hermes uses a SQLite-backed session store as the main source of truth for historical conversation state.
Hermes Agent uses a SQLite database (`~/.hermes/state.db`) to persist session
metadata, full message history, and model configuration across CLI and gateway
sessions. This replaces the earlier per-session JSONL file approach.
Primary files:
Source file: `hermes_state.py`
- `hermes_state.py`
- `gateway/session.py`
- `tools/session_search_tool.py`
## Main database
## Architecture Overview
The primary store lives at:
```text
~/.hermes/state.db
```
~/.hermes/state.db (SQLite, WAL mode)
├── sessions — Session metadata, token counts, billing
├── messages — Full message history per session
├── messages_fts — FTS5 virtual table for full-text search
└── schema_version — Single-row table tracking migration state
```
It contains:
Key design decisions:
- **WAL mode** for concurrent readers + one writer (gateway multi-platform)
- **FTS5 virtual table** for fast text search across all session messages
- **Session lineage** via `parent_session_id` chains (compression-triggered splits)
- **Source tagging** (`cli`, `telegram`, `discord`, etc.) for platform filtering
- Batch runner and RL trajectories are NOT stored here (separate systems)
- sessions
- messages
- metadata such as token counts and titles
- lineage relationships
- full-text search indexes
## What is stored per session
## SQLite Schema
Examples of important session metadata:
### Sessions Table
- session ID
- source/platform
- title
- created/updated timestamps
- token counts
- tool call counts
- stored system prompt snapshot
- parent session ID after compression splits
```sql
CREATE TABLE IF NOT EXISTS sessions (
id TEXT PRIMARY KEY,
source TEXT NOT NULL,
user_id TEXT,
model TEXT,
model_config TEXT,
system_prompt TEXT,
parent_session_id TEXT,
started_at REAL NOT NULL,
ended_at REAL,
end_reason TEXT,
message_count INTEGER DEFAULT 0,
tool_call_count INTEGER DEFAULT 0,
input_tokens INTEGER DEFAULT 0,
output_tokens INTEGER DEFAULT 0,
cache_read_tokens INTEGER DEFAULT 0,
cache_write_tokens INTEGER DEFAULT 0,
reasoning_tokens INTEGER DEFAULT 0,
billing_provider TEXT,
billing_base_url TEXT,
billing_mode TEXT,
estimated_cost_usd REAL,
actual_cost_usd REAL,
cost_status TEXT,
cost_source TEXT,
pricing_version TEXT,
title TEXT,
FOREIGN KEY (parent_session_id) REFERENCES sessions(id)
);
## Lineage
CREATE INDEX IF NOT EXISTS idx_sessions_source ON sessions(source);
CREATE INDEX IF NOT EXISTS idx_sessions_parent ON sessions(parent_session_id);
CREATE INDEX IF NOT EXISTS idx_sessions_started ON sessions(started_at DESC);
CREATE UNIQUE INDEX IF NOT EXISTS idx_sessions_title_unique
ON sessions(title) WHERE title IS NOT NULL;
```
When Hermes compresses a conversation, it can continue in a new session ID while preserving ancestry via `parent_session_id`.
### Messages Table
This means resuming/searching can follow session families instead of treating each compressed shard as unrelated.
```sql
CREATE TABLE IF NOT EXISTS messages (
id INTEGER PRIMARY KEY AUTOINCREMENT,
session_id TEXT NOT NULL REFERENCES sessions(id),
role TEXT NOT NULL,
content TEXT,
tool_call_id TEXT,
tool_calls TEXT,
tool_name TEXT,
timestamp REAL NOT NULL,
token_count INTEGER,
finish_reason TEXT,
reasoning TEXT,
reasoning_details TEXT,
codex_reasoning_items TEXT
);
## Gateway vs CLI persistence
CREATE INDEX IF NOT EXISTS idx_messages_session ON messages(session_id, timestamp);
```
- CLI uses the state DB directly for resume/history/search
- gateway keeps active-session mappings and may also maintain additional platform transcript/state files
- some legacy JSON/JSONL artifacts still exist for compatibility, but SQLite is the main historical store
Notes:
- `tool_calls` is stored as a JSON string (serialized list of tool call objects)
- `reasoning_details` and `codex_reasoning_items` are stored as JSON strings
- `reasoning` stores the raw reasoning text for providers that expose it
- Timestamps are Unix epoch floats (`time.time()`)
## Session search
### FTS5 Full-Text Search
The `session_search` tool uses the session DB's search features to retrieve and summarize relevant past work.
```sql
CREATE VIRTUAL TABLE IF NOT EXISTS messages_fts USING fts5(
content,
content=messages,
content_rowid=id
);
```
## Related docs
The FTS5 table is kept in sync via three triggers that fire on INSERT, UPDATE,
and DELETE of the `messages` table:
- [Gateway Internals](./gateway-internals.md)
- [Prompt Assembly](./prompt-assembly.md)
- [Context Compression & Prompt Caching](./context-compression-and-caching.md)
```sql
CREATE TRIGGER IF NOT EXISTS messages_fts_insert AFTER INSERT ON messages BEGIN
INSERT INTO messages_fts(rowid, content) VALUES (new.id, new.content);
END;
CREATE TRIGGER IF NOT EXISTS messages_fts_delete AFTER DELETE ON messages BEGIN
INSERT INTO messages_fts(messages_fts, rowid, content)
VALUES('delete', old.id, old.content);
END;
CREATE TRIGGER IF NOT EXISTS messages_fts_update AFTER UPDATE ON messages BEGIN
INSERT INTO messages_fts(messages_fts, rowid, content)
VALUES('delete', old.id, old.content);
INSERT INTO messages_fts(rowid, content) VALUES (new.id, new.content);
END;
```
## Schema Version and Migrations
Current schema version: **6**
The `schema_version` table stores a single integer. On initialization,
`_init_schema()` checks the current version and applies migrations sequentially:
| Version | Change |
|---------|--------|
| 1 | Initial schema (sessions, messages, FTS5) |
| 2 | Add `finish_reason` column to messages |
| 3 | Add `title` column to sessions |
| 4 | Add unique index on `title` (NULLs allowed, non-NULL must be unique) |
| 5 | Add billing columns: `cache_read_tokens`, `cache_write_tokens`, `reasoning_tokens`, `billing_provider`, `billing_base_url`, `billing_mode`, `estimated_cost_usd`, `actual_cost_usd`, `cost_status`, `cost_source`, `pricing_version` |
| 6 | Add reasoning columns to messages: `reasoning`, `reasoning_details`, `codex_reasoning_items` |
Each migration uses `ALTER TABLE ADD COLUMN` wrapped in try/except to handle
the column-already-exists case (idempotent). The version number is bumped after
each successful migration block.
## Write Contention Handling
Multiple hermes processes (gateway + CLI sessions + worktree agents) share one
`state.db`. The `SessionDB` class handles write contention with:
- **Short SQLite timeout** (1 second) instead of the default 30s
- **Application-level retry** with random jitter (20-150ms, up to 15 retries)
- **BEGIN IMMEDIATE** transactions to surface lock contention at transaction start
- **Periodic WAL checkpoints** every 50 successful writes (PASSIVE mode)
This avoids the "convoy effect" where SQLite's deterministic internal backoff
causes all competing writers to retry at the same intervals.
```
_WRITE_MAX_RETRIES = 15
_WRITE_RETRY_MIN_S = 0.020 # 20ms
_WRITE_RETRY_MAX_S = 0.150 # 150ms
_CHECKPOINT_EVERY_N_WRITES = 50
```
## Common Operations
### Initialize
```python
from hermes_state import SessionDB
db = SessionDB() # Default: ~/.hermes/state.db
db = SessionDB(db_path=Path("/tmp/test.db")) # Custom path
```
### Create and Manage Sessions
```python
# Create a new session
db.create_session(
session_id="sess_abc123",
source="cli",
model="anthropic/claude-sonnet-4.6",
user_id="user_1",
parent_session_id=None, # or previous session ID for lineage
)
# End a session
db.end_session("sess_abc123", end_reason="user_exit")
# Reopen a session (clear ended_at/end_reason)
db.reopen_session("sess_abc123")
```
### Store Messages
```python
msg_id = db.append_message(
session_id="sess_abc123",
role="assistant",
content="Here's the answer...",
tool_calls=[{"id": "call_1", "function": {"name": "terminal", "arguments": "{}"}}],
token_count=150,
finish_reason="stop",
reasoning="Let me think about this...",
)
```
### Retrieve Messages
```python
# Raw messages with all metadata
messages = db.get_messages("sess_abc123")
# OpenAI conversation format (for API replay)
conversation = db.get_messages_as_conversation("sess_abc123")
# Returns: [{"role": "user", "content": "..."}, {"role": "assistant", ...}]
```
### Session Titles
```python
# Set a title (must be unique among non-NULL titles)
db.set_session_title("sess_abc123", "Fix Docker Build")
# Resolve by title (returns most recent in lineage)
session_id = db.resolve_session_by_title("Fix Docker Build")
# Auto-generate next title in lineage
next_title = db.get_next_title_in_lineage("Fix Docker Build")
# Returns: "Fix Docker Build #2"
```
## Full-Text Search
The `search_messages()` method supports FTS5 query syntax with automatic
sanitization of user input.
### Basic Search
```python
results = db.search_messages("docker deployment")
```
### FTS5 Query Syntax
| Syntax | Example | Meaning |
|--------|---------|---------|
| Keywords | `docker deployment` | Both terms (implicit AND) |
| Quoted phrase | `"exact phrase"` | Exact phrase match |
| Boolean OR | `docker OR kubernetes` | Either term |
| Boolean NOT | `python NOT java` | Exclude term |
| Prefix | `deploy*` | Prefix match |
### Filtered Search
```python
# Search only CLI sessions
results = db.search_messages("error", source_filter=["cli"])
# Exclude gateway sessions
results = db.search_messages("bug", exclude_sources=["telegram", "discord"])
# Search only user messages
results = db.search_messages("help", role_filter=["user"])
```
### Search Results Format
Each result includes:
- `id`, `session_id`, `role`, `timestamp`
- `snippet` — FTS5-generated snippet with `>>>match<<<` markers
- `context` — 1 message before and after the match (content truncated to 200 chars)
- `source`, `model`, `session_started` — from the parent session
The `_sanitize_fts5_query()` method handles edge cases:
- Strips unmatched quotes and special characters
- Wraps hyphenated terms in quotes (`chat-send``"chat-send"`)
- Removes dangling boolean operators (`hello AND``hello`)
## Session Lineage
Sessions can form chains via `parent_session_id`. This happens when context
compression triggers a session split in the gateway.
### Query: Find Session Lineage
```sql
-- Find all ancestors of a session
WITH RECURSIVE lineage AS (
SELECT * FROM sessions WHERE id = ?
UNION ALL
SELECT s.* FROM sessions s
JOIN lineage l ON s.id = l.parent_session_id
)
SELECT id, title, started_at, parent_session_id FROM lineage;
-- Find all descendants of a session
WITH RECURSIVE descendants AS (
SELECT * FROM sessions WHERE id = ?
UNION ALL
SELECT s.* FROM sessions s
JOIN descendants d ON s.parent_session_id = d.id
)
SELECT id, title, started_at FROM descendants;
```
### Query: Recent Sessions with Preview
```sql
SELECT s.*,
COALESCE(
(SELECT SUBSTR(m.content, 1, 63)
FROM messages m
WHERE m.session_id = s.id AND m.role = 'user' AND m.content IS NOT NULL
ORDER BY m.timestamp, m.id LIMIT 1),
''
) AS preview,
COALESCE(
(SELECT MAX(m2.timestamp) FROM messages m2 WHERE m2.session_id = s.id),
s.started_at
) AS last_active
FROM sessions s
ORDER BY s.started_at DESC
LIMIT 20;
```
### Query: Token Usage Statistics
```sql
-- Total tokens by model
SELECT model,
COUNT(*) as session_count,
SUM(input_tokens) as total_input,
SUM(output_tokens) as total_output,
SUM(estimated_cost_usd) as total_cost
FROM sessions
WHERE model IS NOT NULL
GROUP BY model
ORDER BY total_cost DESC;
-- Sessions with highest token usage
SELECT id, title, model, input_tokens + output_tokens AS total_tokens,
estimated_cost_usd
FROM sessions
ORDER BY total_tokens DESC
LIMIT 10;
```
## Export and Cleanup
```python
# Export a single session with messages
data = db.export_session("sess_abc123")
# Export all sessions (with messages) as list of dicts
all_data = db.export_all(source="cli")
# Delete old sessions (only ended sessions)
deleted_count = db.prune_sessions(older_than_days=90)
deleted_count = db.prune_sessions(older_than_days=30, source="telegram")
# Clear messages but keep the session record
db.clear_messages("sess_abc123")
# Delete session and all messages
db.delete_session("sess_abc123")
```
## Database Location
Default path: `~/.hermes/state.db`
This is derived from `hermes_constants.get_hermes_home()` which resolves to
`~/.hermes/` by default, or the value of `HERMES_HOME` environment variable.
The database file, WAL file (`state.db-wal`), and shared-memory file
(`state.db-shm`) are all created in the same directory.

View file

@ -22,6 +22,89 @@ Each tool module calls `registry.register(...)` at import time.
`model_tools.py` is responsible for importing/discovering tool modules and building the schema list used by the model.
### How `registry.register()` works
Every tool file in `tools/` calls `registry.register()` at module level to declare itself. The function signature is:
```python
registry.register(
name="terminal", # Unique tool name (used in API schemas)
toolset="terminal", # Toolset this tool belongs to
schema={...}, # OpenAI function-calling schema (description, parameters)
handler=handle_terminal, # The function that executes when the tool is called
check_fn=check_terminal, # Optional: returns True/False for availability
requires_env=["SOME_VAR"], # Optional: env vars needed (for UI display)
is_async=False, # Whether the handler is an async coroutine
description="Run commands", # Human-readable description
emoji="💻", # Emoji for spinner/progress display
)
```
Each call creates a `ToolEntry` stored in the singleton `ToolRegistry._tools` dict keyed by tool name. If a name collision occurs across toolsets, a warning is logged and the later registration wins.
### Discovery: `_discover_tools()`
When `model_tools.py` is imported, it calls `_discover_tools()` which imports every tool module in order:
```python
_modules = [
"tools.web_tools",
"tools.terminal_tool",
"tools.file_tools",
"tools.vision_tools",
"tools.mixture_of_agents_tool",
"tools.image_generation_tool",
"tools.skills_tool",
"tools.browser_tool",
"tools.cronjob_tools",
"tools.rl_training_tool",
"tools.tts_tool",
"tools.todo_tool",
"tools.memory_tool",
"tools.session_search_tool",
"tools.clarify_tool",
"tools.code_execution_tool",
"tools.delegate_tool",
"tools.process_registry",
"tools.send_message_tool",
"tools.honcho_tools",
"tools.homeassistant_tool",
]
```
Each import triggers the module's `registry.register()` calls. Errors in optional tools (e.g., missing `fal_client` for image generation) are caught and logged — they don't prevent other tools from loading.
After core tool discovery, MCP tools and plugin tools are also discovered:
1. **MCP tools**`tools.mcp_tool.discover_mcp_tools()` reads MCP server config and registers tools from external servers.
2. **Plugin tools**`hermes_cli.plugins.discover_plugins()` loads user/project/pip plugins that may register additional tools.
## Tool availability checking (`check_fn`)
Each tool can optionally provide a `check_fn` — a callable that returns `True` when the tool is available and `False` otherwise. Typical checks include:
- **API key present** — e.g., `lambda: bool(os.environ.get("SERP_API_KEY"))` for web search
- **Service running** — e.g., checking if the Honcho server is configured
- **Binary installed** — e.g., verifying `playwright` is available for browser tools
When `registry.get_definitions()` builds the schema list for the model, it runs each tool's `check_fn()`:
```python
# Simplified from registry.py
if entry.check_fn:
try:
available = bool(entry.check_fn())
except Exception:
available = False # Exceptions = unavailable
if not available:
continue # Skip this tool entirely
```
Key behaviors:
- Check results are **cached per-call** — if multiple tools share the same `check_fn`, it only runs once.
- Exceptions in `check_fn()` are treated as "unavailable" (fail-safe).
- The `is_toolset_available()` method checks whether a toolset's `check_fn` passes, used for UI display and toolset resolution.
## Toolset resolution
Toolsets are named bundles of tools. Hermes resolves them through:
@ -31,10 +114,108 @@ Toolsets are named bundles of tools. Hermes resolves them through:
- dynamic MCP toolsets
- curated special-purpose sets like `hermes-acp`
### How `get_tool_definitions()` filters tools
The main entry point is `model_tools.get_tool_definitions(enabled_toolsets, disabled_toolsets, quiet_mode)`:
1. **If `enabled_toolsets` is provided** — only tools from those toolsets are included. Each toolset name is resolved via `resolve_toolset()` which expands composite toolsets into individual tool names.
2. **If `disabled_toolsets` is provided** — start with ALL toolsets, then subtract the disabled ones.
3. **If neither** — include all known toolsets.
4. **Registry filtering** — the resolved tool name set is passed to `registry.get_definitions()`, which applies `check_fn` filtering and returns OpenAI-format schemas.
5. **Dynamic schema patching** — after filtering, `execute_code` and `browser_navigate` schemas are dynamically adjusted to only reference tools that actually passed filtering (prevents model hallucination of unavailable tools).
### Legacy toolset names
Old toolset names with `_tools` suffixes (e.g., `web_tools`, `terminal_tools`) are mapped to their modern tool names via `_LEGACY_TOOLSET_MAP` for backward compatibility.
## Dispatch
At runtime, tools are dispatched through the central registry, with agent-loop exceptions for some agent-level tools such as memory/todo/session-search handling.
### Dispatch flow: model tool_call → handler execution
When the model returns a `tool_call`, the flow is:
```
Model response with tool_call
run_agent.py agent loop
model_tools.handle_function_call(name, args, task_id, user_task)
[Agent-loop tools?] → handled directly by agent loop (todo, memory, session_search, delegate_task)
[Plugin pre-hook] → invoke_hook("pre_tool_call", ...)
registry.dispatch(name, args, **kwargs)
Look up ToolEntry by name
[Async handler?] → bridge via _run_async()
[Sync handler?] → call directly
Return result string (or JSON error)
[Plugin post-hook] → invoke_hook("post_tool_call", ...)
```
### Error wrapping
All tool execution is wrapped in error handling at two levels:
1. **`registry.dispatch()`** — catches any exception from the handler and returns `{"error": "Tool execution failed: ExceptionType: message"}` as JSON.
2. **`handle_function_call()`** — wraps the entire dispatch in a secondary try/except that returns `{"error": "Error executing tool_name: message"}`.
This ensures the model always receives a well-formed JSON string, never an unhandled exception.
### Agent-loop tools
Four tools are intercepted before registry dispatch because they need agent-level state (TodoStore, MemoryStore, etc.):
- `todo` — planning/task tracking
- `memory` — persistent memory writes
- `session_search` — cross-session recall
- `delegate_task` — spawns subagent sessions
These tools' schemas are still registered in the registry (for `get_tool_definitions`), but their handlers return a stub error if dispatch somehow reaches them directly.
### Async bridging
When a tool handler is async, `_run_async()` bridges it to the sync dispatch path:
- **CLI path (no running loop)** — uses a persistent event loop to keep cached async clients alive
- **Gateway path (running loop)** — spins up a disposable thread with `asyncio.run()`
- **Worker threads (parallel tools)** — uses per-thread persistent loops stored in thread-local storage
## The DANGEROUS_PATTERNS approval flow
The terminal tool integrates a dangerous-command approval system defined in `tools/approval.py`:
1. **Pattern detection**`DANGEROUS_PATTERNS` is a list of `(regex, description)` tuples covering destructive operations:
- Recursive deletes (`rm -rf`)
- Filesystem formatting (`mkfs`, `dd`)
- SQL destructive operations (`DROP TABLE`, `DELETE FROM` without `WHERE`)
- System config overwrites (`> /etc/`)
- Service manipulation (`systemctl stop`)
- Remote code execution (`curl | sh`)
- Fork bombs, process kills, etc.
2. **Detection** — before executing any terminal command, `detect_dangerous_command(command)` checks against all patterns.
3. **Approval prompt** — if a match is found:
- **CLI mode** — an interactive prompt asks the user to approve, deny, or allow permanently
- **Gateway mode** — an async approval callback sends the request to the messaging platform
- **Smart approval** — optionally, an auxiliary LLM can auto-approve low-risk commands that match patterns (e.g., `rm -rf node_modules/` is safe but matches "recursive delete")
4. **Session state** — approvals are tracked per-session. Once you approve "recursive delete" for a session, subsequent `rm -rf` commands don't re-prompt.
5. **Permanent allowlist** — the "allow permanently" option writes the pattern to `config.yaml`'s `command_allowlist`, persisting across sessions.
## Terminal/runtime environments
The terminal system supports multiple backends:

View file

@ -1,56 +1,233 @@
---
sidebar_position: 10
title: "Trajectories & Training Format"
description: "How Hermes saves trajectories, normalizes tool calls, and produces training-friendly outputs"
---
# Trajectory Format
# Trajectories & Training Format
Hermes Agent saves conversation trajectories in ShareGPT-compatible JSONL format
for use as training data, debugging artifacts, and reinforcement learning datasets.
Hermes can save conversation trajectories for training, evaluation, and batch data generation workflows.
Source files: `agent/trajectory.py`, `run_agent.py` (lines 1788-1975), `batch_runner.py`
Primary files:
- `agent/trajectory.py`
- `run_agent.py`
- `batch_runner.py`
- `trajectory_compressor.py`
## File Naming Convention
## What trajectories are for
Trajectories are written to files in the current working directory:
Trajectory outputs are used for:
| File | When |
|------|------|
| `trajectory_samples.jsonl` | Conversations that completed successfully (`completed=True`) |
| `failed_trajectories.jsonl` | Conversations that failed or were interrupted (`completed=False`) |
- SFT data generation
- debugging agent behavior
- benchmark/evaluation artifact capture
- post-processing and compression pipelines
The batch runner (`batch_runner.py`) writes to a custom output file per batch
(e.g., `batch_001_output.jsonl`) with additional metadata fields.
## Normalization strategy
You can override the filename via the `filename` parameter in `save_trajectory()`.
Hermes converts live conversation structure into a training-friendly format.
Important behaviors include:
## JSONL Entry Format
- representing reasoning in explicit markup
- converting tool calls into structured XML-like regions for dataset compatibility
- grouping tool outputs appropriately
- separating successful and failed trajectories
Each line in the file is a self-contained JSON object. There are two variants:
## Persistence boundaries
### CLI/Interactive Format (from `_save_trajectory`)
Trajectory files do **not** blindly mirror all runtime prompt state.
```json
{
"conversations": [ ... ],
"timestamp": "2026-03-30T14:22:31.456789",
"model": "anthropic/claude-sonnet-4.6",
"completed": true
}
```
Some prompt-time-only layers are intentionally excluded from persisted trajectory content so datasets are cleaner and less environment-specific.
### Batch Runner Format (from `batch_runner.py`)
## Batch runner
```json
{
"prompt_index": 42,
"conversations": [ ... ],
"metadata": { "prompt_source": "gsm8k", "difficulty": "hard" },
"completed": true,
"partial": false,
"api_calls": 7,
"toolsets_used": ["code_tools", "file_tools"],
"tool_stats": {
"terminal": {"count": 3, "success": 3, "failure": 0},
"read_file": {"count": 2, "success": 2, "failure": 0},
"write_file": {"count": 0, "success": 0, "failure": 0}
},
"tool_error_counts": {
"terminal": 0,
"read_file": 0,
"write_file": 0
}
}
```
`batch_runner.py` emits richer metadata than single-session trajectory saving, including:
The `tool_stats` and `tool_error_counts` dictionaries are normalized to include
ALL possible tools (from `model_tools.TOOL_TO_TOOLSET_MAP`) with zero defaults,
ensuring consistent schema across entries for HuggingFace dataset loading.
- model/provider metadata
- toolset info
- partial/failure markers
- tool statistics
## Related docs
## Conversations Array (ShareGPT Format)
- [Environments, Benchmarks & Data Generation](./environments.md)
- [Agent Loop Internals](./agent-loop.md)
The `conversations` array uses ShareGPT role conventions:
| API Role | ShareGPT `from` |
|----------|-----------------|
| system | `"system"` |
| user | `"human"` |
| assistant | `"gpt"` |
| tool | `"tool"` |
### Complete Example
```json
{
"conversations": [
{
"from": "system",
"value": "You are a function calling AI model. You are provided with function signatures within <tools> </tools> XML tags. You may call one or more functions to assist with the user query. If available tools are not relevant in assisting with user query, just respond in natural conversational language. Don't make assumptions about what values to plug into functions. After calling & executing the functions, you will be provided with function results within <tool_response> </tool_response> XML tags. Here are the available tools:\n<tools>\n[{\"name\": \"terminal\", \"description\": \"Execute shell commands\", \"parameters\": {\"type\": \"object\", \"properties\": {\"command\": {\"type\": \"string\"}}}, \"required\": null}]\n</tools>\nFor each function call return a JSON object, with the following pydantic model json schema for each:\n{'title': 'FunctionCall', 'type': 'object', 'properties': {'name': {'title': 'Name', 'type': 'string'}, 'arguments': {'title': 'Arguments', 'type': 'object'}}, 'required': ['name', 'arguments']}\nEach function call should be enclosed within <tool_call> </tool_call> XML tags.\nExample:\n<tool_call>\n{'name': <function-name>,'arguments': <args-dict>}\n</tool_call>"
},
{
"from": "human",
"value": "What Python version is installed?"
},
{
"from": "gpt",
"value": "<think>\nThe user wants to know the Python version. I should run python3 --version.\n</think>\n<tool_call>\n{\"name\": \"terminal\", \"arguments\": {\"command\": \"python3 --version\"}}\n</tool_call>"
},
{
"from": "tool",
"value": "<tool_response>\n{\"tool_call_id\": \"call_abc123\", \"name\": \"terminal\", \"content\": \"Python 3.11.6\"}\n</tool_response>"
},
{
"from": "gpt",
"value": "<think>\nGot the version. I can now answer the user.\n</think>\nPython 3.11.6 is installed on this system."
}
],
"timestamp": "2026-03-30T14:22:31.456789",
"model": "anthropic/claude-sonnet-4.6",
"completed": true
}
```
## Normalization Rules
### Reasoning Content Markup
The trajectory converter normalizes ALL reasoning into `<think>` tags, regardless
of how the model originally produced it:
1. **Native thinking tokens** (`msg["reasoning"]` field from providers like
Anthropic, OpenAI o-series): Wrapped as `<think>\n{reasoning}\n</think>\n`
and prepended before the content.
2. **REASONING_SCRATCHPAD XML** (when native thinking is disabled and the model
reasons via system-prompt-instructed XML): `<REASONING_SCRATCHPAD>` tags are
converted to `<think>` via `convert_scratchpad_to_think()`.
3. **Empty think blocks**: Every `gpt` turn is guaranteed to have a `<think>`
block. If no reasoning was produced, an empty block is inserted:
`<think>\n</think>\n` — this ensures consistent format for training data.
### Tool Call Normalization
Tool calls from the API format (with `tool_call_id`, function name, arguments as
JSON string) are converted to XML-wrapped JSON:
```
<tool_call>
{"name": "terminal", "arguments": {"command": "ls -la"}}
</tool_call>
```
- Arguments are parsed from JSON strings back to objects (not double-encoded)
- If JSON parsing fails (shouldn't happen — validated during conversation),
an empty `{}` is used with a warning logged
- Multiple tool calls in one assistant turn produce multiple `<tool_call>` blocks
in a single `gpt` message
### Tool Response Normalization
All tool results following an assistant message are grouped into a single `tool`
turn with XML-wrapped JSON responses:
```
<tool_response>
{"tool_call_id": "call_abc123", "name": "terminal", "content": "output here"}
</tool_response>
```
- If tool content looks like JSON (starts with `{` or `[`), it's parsed so the
content field contains a JSON object/array rather than a string
- Multiple tool results are joined with newlines in one message
- The tool name is matched by position against the parent assistant's `tool_calls`
array
### System Message
The system message is generated at save time (not taken from the conversation).
It follows the Hermes function-calling prompt template with:
- Preamble explaining the function-calling protocol
- `<tools>` XML block containing the JSON tool definitions
- Schema reference for `FunctionCall` objects
- `<tool_call>` example
Tool definitions include `name`, `description`, `parameters`, and `required`
(set to `null` to match the canonical format).
## Loading Trajectories
Trajectories are standard JSONL — load with any JSON-lines reader:
```python
import json
def load_trajectories(path: str):
"""Load trajectory entries from a JSONL file."""
entries = []
with open(path, "r", encoding="utf-8") as f:
for line in f:
line = line.strip()
if line:
entries.append(json.loads(line))
return entries
# Filter to successful completions only
successful = [e for e in load_trajectories("trajectory_samples.jsonl")
if e.get("completed")]
# Extract just the conversations for training
training_data = [e["conversations"] for e in successful]
```
### Loading for HuggingFace Datasets
```python
from datasets import load_dataset
ds = load_dataset("json", data_files="trajectory_samples.jsonl")
```
The normalized `tool_stats` schema ensures all entries have the same columns,
preventing Arrow schema mismatch errors during dataset loading.
## Controlling Trajectory Saving
In the CLI, trajectory saving is controlled by:
```yaml
# config.yaml
agent:
save_trajectories: true # default: false
```
Or via the `--save-trajectories` flag. When the agent initializes with
`save_trajectories=True`, the `_save_trajectory()` method is called at the end
of each conversation turn.
The batch runner always saves trajectories (that's its primary purpose).
Samples with zero reasoning across all turns are automatically discarded by the
batch runner to avoid polluting training data with non-reasoning examples.