fix: move pre_llm_call plugin context to user message, preserve prompt cache (#5146)

Plugin context from pre_llm_call hooks was injected into the system prompt, breaking the prompt cache prefix every turn when content changed (typical for memory plugins). Now all plugin context goes into the current turn's user message — the system prompt stays identical across turns, preserving cached tokens. The system prompt is reserved for Hermes internals. Plugins contribute context alongside the user's input. Also adds comprehensive documentation for all 6 plugin hooks: pre_tool_call, post_tool_call, pre_llm_call, post_llm_call, on_session_start, on_session_end — each with full callback signatures, parameter tables, firing conditions, and examples. Supersedes #5138 which identified the same cache-busting bug and proposed an uncached system suffix approach. This fix goes further by removing system prompt injection entirely. Co-identified-by: OutThisLife (PR #5138)
2026-04-27 01:11:40 +00:00 · 2026-04-04 16:55:44 -07:00 · 2026-04-04 16:55:44 -07:00 · 5879b3ef82
commit 5879b3ef82
parent 96e96a79ad
6 changed files with 653 additions and 57 deletions
--- a/website/docs/guides/build-a-hermes-plugin.md
+++ b/website/docs/guides/build-a-hermes-plugin.md
@ -362,24 +362,124 @@ ctx.register_tool(
 def register(ctx):
    ctx.register_hook("pre_tool_call", before_any_tool)
    ctx.register_hook("post_tool_call", after_any_tool)
+    ctx.register_hook("pre_llm_call", inject_memory)
    ctx.register_hook("on_session_start", on_new_session)
    ctx.register_hook("on_session_end", on_session_end)
 ```

-Available hooks:
+### Hook reference

-| Hook | When | Arguments | Return |
-|------|------|-----------|--------|
-| `pre_tool_call` | Before any tool runs | `tool_name`, `args`, `task_id` | — |
-| `post_tool_call` | After any tool returns | `tool_name`, `args`, `result`, `task_id` | — |
-| `pre_llm_call` | Once per turn, before the LLM loop | `session_id`, `user_message`, `conversation_history`, `is_first_turn`, `model`, `platform` | `{"context": "..."}` |
-| `post_llm_call` | Once per turn, after the LLM loop | `session_id`, `user_message`, `assistant_response`, `conversation_history`, `model`, `platform` | — |
-| `on_session_start` | New session created (first turn only) | `session_id`, `model`, `platform` | — |
-| `on_session_end` | End of every `run_conversation` call | `session_id`, `completed`, `interrupted`, `model`, `platform` | — |
+Each hook is documented in full on the **[Event Hooks reference](/docs/user-guide/features/hooks#plugin-hooks)** — callback signatures, parameter tables, exactly when each fires, and examples. Here's the summary:

-Most hooks are fire-and-forget observers. The exception is `pre_llm_call`: if a callback returns a dict with a `"context"` key (or a plain string), the value is appended to the ephemeral system prompt for the current turn. This allows memory plugins to inject recalled context without touching core code.
+| Hook | Fires when | Callback signature | Returns |
+|------|-----------|-------------------|---------|
+| [`pre_tool_call`](/docs/user-guide/features/hooks#pre_tool_call) | Before any tool executes | `tool_name: str, args: dict, task_id: str` | ignored |
+| [`post_tool_call`](/docs/user-guide/features/hooks#post_tool_call) | After any tool returns | `tool_name: str, args: dict, result: str, task_id: str` | ignored |
+| [`pre_llm_call`](/docs/user-guide/features/hooks#pre_llm_call) | Once per turn, before the tool-calling loop | `session_id: str, user_message: str, conversation_history: list, is_first_turn: bool, model: str, platform: str` | [context injection](#pre_llm_call-context-injection) |
+| [`post_llm_call`](/docs/user-guide/features/hooks#post_llm_call) | Once per turn, after the tool-calling loop (successful turns only) | `session_id: str, user_message: str, assistant_response: str, conversation_history: list, model: str, platform: str` | ignored |
+| [`on_session_start`](/docs/user-guide/features/hooks#on_session_start) | New session created (first turn only) | `session_id: str, model: str, platform: str` | ignored |
+| [`on_session_end`](/docs/user-guide/features/hooks#on_session_end) | End of every `run_conversation` call + CLI exit | `session_id: str, completed: bool, interrupted: bool, model: str, platform: str` | ignored |

-If a hook crashes, it's logged and skipped; other hooks and the agent continue normally.
+Most hooks are fire-and-forget observers — their return values are ignored. The exception is `pre_llm_call`, which can inject context into the conversation.
+
+All callbacks should accept `**kwargs` for forward compatibility. If a hook callback crashes, it's logged and skipped. Other hooks and the agent continue normally.
+
+### `pre_llm_call` context injection
+
+This is the only hook whose return value matters. When a `pre_llm_call` callback returns a dict with a `"context"` key (or a plain string), Hermes injects that text into the **current turn's user message**. This is the mechanism for memory plugins, RAG integrations, guardrails, and any plugin that needs to provide the model with additional context.
+
+#### Return format
+
+```python
+# Dict with context key
+return {"context": "Recalled memories:\n- User prefers dark mode\n- Last project: hermes-agent"}
+
+# Plain string (equivalent to the dict form above)
+return "Recalled memories:\n- User prefers dark mode"
+
+# Return None or don't return → no injection (observer-only)
+return None
+```
+
+Any non-None, non-empty return with a `"context"` key (or a plain non-empty string) is collected and appended to the user message for the current turn.
+
+#### How injection works
+
+Injected context is appended to the **user message**, not the system prompt. This is a deliberate design choice:
+
+- **Prompt cache preservation** — the system prompt stays identical across turns. Anthropic and OpenRouter cache the system prompt prefix, so keeping it stable saves 75%+ on input tokens in multi-turn conversations. If plugins modified the system prompt, every turn would be a cache miss.
+- **Ephemeral** — the injection happens at API call time only. The original user message in the conversation history is never mutated, and nothing is persisted to the session database.
+- **The system prompt is Hermes's territory** — it contains model-specific guidance, tool enforcement rules, personality instructions, and cached skill content. Plugins contribute context alongside the user's input, not by altering the agent's core instructions.
+
+#### Example: Memory recall plugin
+
+```python
+"""Memory plugin — recalls relevant context from a vector store."""
+
+import httpx
+
+MEMORY_API = "https://your-memory-api.example.com"
+
+def recall_context(session_id, user_message, is_first_turn, **kwargs):
+    """Called before each LLM turn. Returns recalled memories."""
+    try:
+        resp = httpx.post(f"{MEMORY_API}/recall", json={
+            "session_id": session_id,
+            "query": user_message,
+        }, timeout=3)
+        memories = resp.json().get("results", [])
+        if not memories:
+            return None  # nothing to inject
+
+        text = "Recalled context from previous sessions:\n"
+        text += "\n".join(f"- {m['text']}" for m in memories)
+        return {"context": text}
+    except Exception:
+        return None  # fail silently, don't break the agent
+
+def register(ctx):
+    ctx.register_hook("pre_llm_call", recall_context)
+```
+
+#### Example: Guardrails plugin
+
+```python
+"""Guardrails plugin — enforces content policies."""
+
+POLICY = """You MUST follow these content policies for this session:
+- Never generate code that accesses the filesystem outside the working directory
+- Always warn before executing destructive operations
+- Refuse requests involving personal data extraction"""
+
+def inject_guardrails(**kwargs):
+    """Injects policy text into every turn."""
+    return {"context": POLICY}
+
+def register(ctx):
+    ctx.register_hook("pre_llm_call", inject_guardrails)
+```
+
+#### Example: Observer-only hook (no injection)
+
+```python
+"""Analytics plugin — tracks turn metadata without injecting context."""
+
+import logging
+logger = logging.getLogger(__name__)
+
+def log_turn(session_id, user_message, model, is_first_turn, **kwargs):
+    """Fires before each LLM call. Returns None — no context injected."""
+    logger.info("Turn: session=%s model=%s first=%s msg_len=%d",
+                session_id, model, is_first_turn, len(user_message or ""))
+    # No return → no injection
+
+def register(ctx):
+    ctx.register_hook("pre_llm_call", log_turn)
+```
+
+#### Multiple plugins returning context
+
+When multiple plugins return context from `pre_llm_call`, their outputs are joined with double newlines and appended to the user message together. The order follows plugin discovery order (alphabetical by plugin directory name).

 ### Distribute via pip