* docs: fix ascii-guard border alignment errors
Three docs pages had ASCII diagram boxes with off-by-one column
alignment issues that failed docs-site-checks CI:
- architecture.md: outer box is 71 cols but inner-box content lines
and border corners were offset by 1 col, making content-line right
border at col 70/72 while top/bottom border was at col 71. Inner
boxes also had border corners at cols 19/36/53 but content pipes
at cols 20/37/54. Rewrote the diagram with consistent 71-col width
throughout, aligned inner boxes at cols 4-19, 22-37, 40-55 with
2-space gaps and 15-space trailing padding.
- gateway-internals.md: same class of issue — outer box at 51 cols,
inner content lines varied 52-54 cols. Rewrote with consistent
51-col width, inner boxes at cols 4-15, 18-29, 32-43. Also
restructured the bottom-half message flow so it's bare text
(not half-open box cells) matching the intent of the original.
- agent-loop.md line 112-114: box 2 (API thread) content lines had
one extra space pushing the right border to col 46 while the top
and bottom borders of that box sat at col 45. Trimmed one trailing
space from each of the three content lines.
All 123 docs files now pass `npm run lint:diagrams`:
✓ Errors: 0 (warnings: 6, non-fatal)
Pre-existing failures on main — unrelated to any open PR.
* test(setup): accept description kwarg in prompt_choice mock lambdas
setup.py's `_curses_prompt_choice` gained an optional `description`
parameter (used for rendering context hints alongside the prompt).
`prompt_choice` forwards it via keyword arg. The two existing tests
mocked `_curses_prompt_choice` with lambdas that didn't accept the
new kwarg, so the forwarded call raised TypeError.
Fix: add `description=None` to both mock lambda signatures so they
absorb the new kwarg without changing behavior.
* test(matrix): update stale audio-caching assertion
test_regular_audio_has_http_url asserted that non-voice audio
messages keep their HTTP URL and are NOT downloaded/cached. That
was true when the caching code only triggered on
`is_voice_message`. Since bec02f37 (encrypted-media caching
refactor), matrix.py caches all media locally — photos, audio,
video, documents — so downstream tools can read them as real
files via media_urls. This applies to regular audio too.
Renamed the test to `test_regular_audio_is_cached_locally`,
flipped the assertions accordingly, and documented the
intentional behavior change in the docstring. Other tests in
the file (voice-specific caching, message-type detection,
reply-to threading) continue to pass.
* test(413): allow multi-pass preflight compression
run_agent.py's preflight compression runs up to 3 passes in a loop
for very large sessions (each pass summarizes the middle N turns,
then re-checks tokens). The loop breaks when a pass returns a
message list no shorter than its input (can't compress further).
test_preflight_compresses_oversized_history used a static mock
return value that returned the same 2 messages regardless of input,
so the loop ran pass 1 (41 -> 2) and pass 2 (2 -> 2 -> break),
making call_count == 2. The assert_called_once() assertion was
strictly wrong under the multi-pass design.
The invariant the test actually cares about is: preflight ran, and
its first invocation received the full oversized history. Replaced
the count assertion with those two invariants.
* docs: drop '...' from gateway diagram, merge side-by-side boxes
ascii-guard 2.3.0 flagged two remaining issues after the initial fix
pass:
1. gateway-internals.md L33: the '...' suffix after inner box 3's
right border got parsed as 'extra characters after inner-box right
border'. Dropped the '...' — the surrounding prose already conveys
'and more platforms' without needing the visual hint.
2. agent-loop.md: ascii-guard can't cleanly parse two side-by-side
boxes of different heights (main thread 7 rows, API thread 5 rows).
Even equalizing heights didn't help — the linter treats the left
box's right border as the end of the diagram. Merged into a single
54-char-wide outer box with both threads labeled as regions inside,
keeping the ▶ arrow to preserve the main→API flow direction.
10 KiB
| sidebar_position | title | description |
|---|---|---|
| 3 | Agent Loop Internals | Detailed walkthrough of AIAgent execution, API modes, tools, callbacks, and fallback behavior |
Agent Loop Internals
The core orchestration engine is run_agent.py's AIAgent class — roughly 10,700 lines that handle everything from prompt assembly to tool dispatch to provider failover.
Core Responsibilities
AIAgent is responsible for:
- Assembling the effective system prompt and tool schemas via
prompt_builder.py - Selecting the correct provider/API mode (chat_completions, codex_responses, anthropic_messages)
- Making interruptible model calls with cancellation support
- Executing tool calls (sequentially or concurrently via thread pool)
- Maintaining conversation history in OpenAI message format
- Handling compression, retries, and fallback model switching
- Tracking iteration budgets across parent and child agents
- Flushing persistent memory before context is lost
Two Entry Points
# Simple interface — returns final response string
response = agent.chat("Fix the bug in main.py")
# Full interface — returns dict with messages, metadata, usage stats
result = agent.run_conversation(
user_message="Fix the bug in main.py",
system_message=None, # auto-built if omitted
conversation_history=None, # auto-loaded from session if omitted
task_id="task_abc123"
)
chat() is a thin wrapper around run_conversation() that extracts the final_response field from the result dict.
API Modes
Hermes supports three API execution modes, resolved from provider selection, explicit args, and base URL heuristics:
| API mode | Used for | Client type |
|---|---|---|
chat_completions |
OpenAI-compatible endpoints (OpenRouter, custom, most providers) | openai.OpenAI |
codex_responses |
OpenAI Codex / Responses API | openai.OpenAI with Responses format |
anthropic_messages |
Native Anthropic Messages API | anthropic.Anthropic via adapter |
The mode determines how messages are formatted, how tool calls are structured, how responses are parsed, and how caching/streaming works. All three converge on the same internal message format (OpenAI-style role/content/tool_calls dicts) before and after API calls.
Mode resolution order:
- Explicit
api_modeconstructor arg (highest priority) - Provider-specific detection (e.g.,
anthropicprovider →anthropic_messages) - Base URL heuristics (e.g.,
api.anthropic.com→anthropic_messages) - Default:
chat_completions
Turn Lifecycle
Each iteration of the agent loop follows this sequence:
run_conversation()
1. Generate task_id if not provided
2. Append user message to conversation history
3. Build or reuse cached system prompt (prompt_builder.py)
4. Check if preflight compression is needed (>50% context)
5. Build API messages from conversation history
- chat_completions: OpenAI format as-is
- codex_responses: convert to Responses API input items
- anthropic_messages: convert via anthropic_adapter.py
6. Inject ephemeral prompt layers (budget warnings, context pressure)
7. Apply prompt caching markers if on Anthropic
8. Make interruptible API call (_api_call_with_interrupt)
9. Parse response:
- If tool_calls: execute them, append results, loop back to step 5
- If text response: persist session, flush memory if needed, return
Message Format
All messages use OpenAI-compatible format internally:
{"role": "system", "content": "..."}
{"role": "user", "content": "..."}
{"role": "assistant", "content": "...", "tool_calls": [...]}
{"role": "tool", "tool_call_id": "...", "content": "..."}
Reasoning content (from models that support extended thinking) is stored in assistant_msg["reasoning"] and optionally displayed via the reasoning_callback.
Message Alternation Rules
The agent loop enforces strict message role alternation:
- After the system message:
User → Assistant → User → Assistant → ... - During tool calling:
Assistant (with tool_calls) → Tool → Tool → ... → Assistant - Never two assistant messages in a row
- Never two user messages in a row
- Only
toolrole can have consecutive entries (parallel tool results)
Providers validate these sequences and will reject malformed histories.
Interruptible API Calls
API requests are wrapped in _api_call_with_interrupt() which runs the actual HTTP call in a background thread while monitoring an interrupt event:
┌────────────────────────────────────────────────────┐
│ Main thread API thread │
│ │
│ wait on: HTTP POST │
│ - response ready ───▶ to provider │
│ - interrupt event │
│ - timeout │
└────────────────────────────────────────────────────┘
When interrupted (user sends new message, /stop command, or signal):
- The API thread is abandoned (response discarded)
- The agent can process the new input or shut down cleanly
- No partial response is injected into conversation history
Tool Execution
Sequential vs Concurrent
When the model returns tool calls:
- Single tool call → executed directly in the main thread
- Multiple tool calls → executed concurrently via
ThreadPoolExecutor- Exception: tools marked as interactive (e.g.,
clarify) force sequential execution - Results are reinserted in the original tool call order regardless of completion order
- Exception: tools marked as interactive (e.g.,
Execution Flow
for each tool_call in response.tool_calls:
1. Resolve handler from tools/registry.py
2. Fire pre_tool_call plugin hook
3. Check if dangerous command (tools/approval.py)
- If dangerous: invoke approval_callback, wait for user
4. Execute handler with args + task_id
5. Fire post_tool_call plugin hook
6. Append {"role": "tool", "content": result} to history
Agent-Level Tools
Some tools are intercepted by run_agent.py before reaching handle_function_call():
| Tool | Why intercepted |
|---|---|
todo |
Reads/writes agent-local task state |
memory |
Writes to persistent memory files with character limits |
session_search |
Queries session history via the agent's session DB |
delegate_task |
Spawns subagent(s) with isolated context |
These tools modify agent state directly and return synthetic tool results without going through the registry.
Callback Surfaces
AIAgent supports platform-specific callbacks that enable real-time progress in the CLI, gateway, and ACP integrations:
| Callback | When fired | Used by |
|---|---|---|
tool_progress_callback |
Before/after each tool execution | CLI spinner, gateway progress messages |
thinking_callback |
When model starts/stops thinking | CLI "thinking..." indicator |
reasoning_callback |
When model returns reasoning content | CLI reasoning display, gateway reasoning blocks |
clarify_callback |
When clarify tool is called |
CLI input prompt, gateway interactive message |
step_callback |
After each complete agent turn | Gateway step tracking, ACP progress |
stream_delta_callback |
Each streaming token (when enabled) | CLI streaming display |
tool_gen_callback |
When tool call is parsed from stream | CLI tool preview in spinner |
status_callback |
State changes (thinking, executing, etc.) | ACP status updates |
Budget and Fallback Behavior
Iteration Budget
The agent tracks iterations via IterationBudget:
- Default: 90 iterations (configurable via
agent.max_turns) - Each agent gets its own budget. Subagents get independent budgets capped at
delegation.max_iterations(default 50) — total iterations across parent + subagents can exceed the parent's cap - At 100%, the agent stops and returns a summary of work done
Fallback Model
When the primary model fails (429 rate limit, 5xx server error, 401/403 auth error):
- Check
fallback_providerslist in config - Try each fallback in order
- On success, continue the conversation with the new provider
- On 401/403, attempt credential refresh before failing over
The fallback system also covers auxiliary tasks independently — vision, compression, web extraction, and session search each have their own fallback chain configurable via the auxiliary.* config section.
Compression and Persistence
When Compression Triggers
- Preflight (before API call): If conversation exceeds 50% of model's context window
- Gateway auto-compression: If conversation exceeds 85% (more aggressive, runs between turns)
What Happens During Compression
- Memory is flushed to disk first (preventing data loss)
- Middle conversation turns are summarized into a compact summary
- The last N messages are preserved intact (
compression.protect_last_n, default: 20) - Tool call/result message pairs are kept together (never split)
- A new session lineage ID is generated (compression creates a "child" session)
Session Persistence
After each turn:
- Messages are saved to the session store (SQLite via
hermes_state.py) - Memory changes are flushed to
MEMORY.md/USER.md - The session can be resumed later via
/resumeorhermes chat --resume
Key Source Files
| File | Purpose |
|---|---|
run_agent.py |
AIAgent class — the complete agent loop (~10,700 lines) |
agent/prompt_builder.py |
System prompt assembly from memory, skills, context files, personality |
agent/context_engine.py |
ContextEngine ABC — pluggable context management |
agent/context_compressor.py |
Default engine — lossy summarization algorithm |
agent/prompt_caching.py |
Anthropic prompt caching markers and cache metrics |
agent/auxiliary_client.py |
Auxiliary LLM client for side tasks (vision, summarization) |
model_tools.py |
Tool schema collection, handle_function_call() dispatch |