hermes-agent/docs/session-lifecycle.md
2026-06-20 23:23:47 -07:00

30 KiB

Session Lifecycle

Audience: Gateway developers and maintainers Source files: gateway/session.py (~1444 lines), gateway/run.py (~16800 lines), gateway/config.py Last updated: 2026-06-16

Overview

A session represents a continuous conversation between the agent and one or more users on a messaging platform. The session lifecycle governs when conversations persist, when they reset, how they survive gateway restarts, and how messages queue during concurrent operations.

The session system lives primarily in two modules:

  • gateway/session.py — Data model (SessionSource, SessionEntry, SessionContext), key generation (build_session_key), and the main store (SessionStore).
  • gateway/run.py — Gateway runner (GatewayRunner) that wires sessions into the message processing pipeline: session expiry watching, agent caching, restart recovery, and message queuing.

1. SessionSource — Message Origin Descriptor

SessionSource is a frozen record of where a message came from. It is attached to every incoming MessageEvent and used for routing, isolation, and context injection.

Fields

Field Type Default Description
platform Platform (required) Enum identifying the messaging platform (telegram, discord, slack, signal, whatsapp, matrix, local, etc.).
chat_id str (required) Platform-level chat/group/channel identifier. Routed through the adapter's chat_id_key transform.
chat_name Optional[str] None Human-readable name of the chat or group.
chat_type str "dm" One of "dm", "group", "channel", "thread". Controls session key generation and isolation.
user_id Optional[str] None Platform-specific user identifier. Used for authorization and per-user session isolation.
user_name Optional[str] None Display name of the message author. Injected into system prompt.
thread_id Optional[str] None Forum topic / Discord thread / Slack thread identifier. Differentiates threaded conversations.
chat_topic Optional[str] None Channel topic or description (Discord channel topic, Slack channel purpose).
user_id_alt Optional[str] None Platform-specific stable alternative ID (Signal UUID, Feishu union_id). Used when user_id is ephemeral.
chat_id_alt Optional[str] None Signal group internal ID — maps a Signal group V2 identifier to its canonical form.
is_bot bool False True when the message author is a bot or webhook (Discord bots).
guild_id Optional[str] None Discord guild / Slack workspace / Matrix server scope identifier.
parent_chat_id Optional[str] None Parent channel when chat_id refers to a thread.
message_id Optional[str] None ID of the triggering message. Used for pin/reply/react operations and Discord ID injection.
role_authorized bool False True when adapter granted access via a platform role (not individual user ID).

Key Methods

  • description (property: str) — Human-readable summary e.g. "DM with Alice", "group: My Group, thread: 12345".
  • to_dict() / from_dict() — Serialization round-trip for persistence in sessions.json.

2. SessionEntry — Active Session Record

SessionEntry is the per-session metadata record stored in memory and persisted to {sessions_dir}/sessions.json. Each entry maps a session_key to its current session_id.

Fields

Field Type Default Description
session_key str (required) Deterministic key identifying the conversation lane (see §4).
session_id str (required) Unique identifier for this specific conversation incarnation. Format: YYYYMMDD_HHMMSS_<8hex>.
created_at datetime (required) When this session incarnation was created.
updated_at datetime (required) Last activity timestamp. Used for idle timeout and expiry checks.
origin Optional[SessionSource] None The source that created this session, used for delivery routing.
display_name Optional[str] None Chat display name (sourced from SessionSource.chat_name).
platform Optional[Platform] None Platform enum, persisted for expiry policy lookup across restarts.
chat_type str "dm" Chat type, also persisted for policy lookup.
input_tokens int 0 Cumulative LLM input (prompt) tokens consumed.
output_tokens int 0 Cumulative LLM output (completion) tokens consumed.
cache_read_tokens int 0 Cumulative prompt cache read tokens.
cache_write_tokens int 0 Cumulative prompt cache write tokens.
total_tokens int 0 Total token count across all turns.
estimated_cost_usd float 0.0 Estimated cumulative USD cost.
cost_status str "unknown" Cost tracking status label.
last_prompt_tokens int 0 Last API-reported prompt token count. Used for accurate compression pre-check.

Boolean Flags (State Machine)

SessionEntry has several boolean flags that form a simple state machine governing session behavior on the next access.

Flag Type Default Description
was_auto_reset bool False Set when a session was auto-reset due to policy expiry (idle/daily). Consumed once to inject a context notice.
auto_reset_reason Optional[str] None "idle" or "daily" — why the previous session was auto-reset.
reset_had_activity bool False Whether the expired session had any messages (total_tokens > 0).
is_fresh_reset bool False Set by explicit /new or /reset. Triggers topic/channel skill re-injection on first message. Distinguished from was_auto_reset to avoid misleading "session expired" notices.
expiry_finalized bool False Set by background expiry watcher after invoking on_session_finalize hooks, cleaning tool resources, and evicting the cached agent. Prevents redundant finalization across restarts.
suspended bool False Hard force-wipe signal. Set by /stop or stuck-loop escalation (3+ consecutive restart failures). On next get_or_create_session(), forces a new session_id regardless of resume_pending.
resume_pending bool False Soft recovery marker. Set by suspend_recently_active() (crash recovery) or drain timeout. On next access, preserves the existing session_id — the user continues on the same transcript. Cleared after the next successful turn completes.
resume_reason Optional[str] None Why resume was marked: "restart_timeout", "shutdown_timeout", "restart_interrupted".
last_resume_marked_at Optional[datetime] None Timestamp of the last resume-pending marking.

State Transition Logic (get_or_create_session)

                    ┌──────────┐
                    │  Incoming │
                    │  Message  │
                    └────┬─────┘
                         │
                         ▼
              ┌──────────────────────┐
              │  session_key exists  │──── No ──► Create fresh SessionEntry
              │  AND !force_new      │
              └──────────┬───────────┘
                         │ Yes
                         ▼
              ┌──────────────────────┐
              │  entry.suspended?    │──── Yes ──► Auto-reset: new session_id
              └──────────┬───────────┘           (reason="suspended")
                         │ No
                         ▼
              ┌──────────────────────┐
              │ entry.resume_pending?│──── Yes ──► Return existing entry
              └──────────┬───────────┘           (preserve session_id)
                         │ No                     Clear flag on next successful turn
                         ▼
              ┌──────────────────────┐
              │   Policy says reset? │──── Yes ──► Auto-reset: new session_id
              └──────────┬───────────┘           (reason="idle"/"daily")
                         │ No
                         ▼
              ┌──────────────────────┐
              │  Return existing     │
              │  entry, bump         │
              │  updated_at          │
              └──────────────────────┘

Priority order in get_or_create_session():

  1. suspended=True → always force-reset (hard wipe)
  2. resume_pending=True → preserve session_id (soft recovery)
  3. Policy expiry (idle/daily) → auto-reset
  4. No trigger → return existing entry (bump updated_at)

3. SessionStore — Storage and Operations

SessionStore is the main storage layer. It maintains an in-memory dict (_entries) persisted to sessions.json, with SQLite (SessionDB) as the canonical store for session metadata and message transcripts.

Constructor

SessionStore(sessions_dir: Path, config: GatewayConfig, has_active_processes_fn=None)
  • sessions_dir — Directory where sessions.json lives.
  • configGatewayConfig instance for reset policy lookups.
  • has_active_processes_fn — Optional callback keyed by session_key to check for running background processes. Sessions with active processes are never expired or pruned.

Operations (Methods)

Method Description
get_or_create_session(source, force_new=False) Core entry point. Returns existing or creates new SessionEntry. Evaluates suspended, resume_pending, and reset policy. Creates/ends SQLite records.
update_session(session_key, last_prompt_tokens=None) Lightweight metadata update after an interaction. Bumps updated_at, optionally records last_prompt_tokens.
reset_session(session_key, display_name=None) Explicit reset (from /new or /reset). Creates new session_id, sets is_fresh_reset=True. Ends old SQLite session, creates new one.
switch_session(session_key, target_session_id) Switch to a different existing session ID (from /resume). Ends current SQLite session, reopens target.
suspend_session(session_key) Mark session as suspended=True (from /stop). Forces auto-reset on next access.
mark_resume_pending(session_key, reason) Mark session as resume_pending=True (from drain timeout). Preserves session_id on next access. Will NOT override suspended=True.
clear_resume_pending(session_key) Clear resume_pending after a successful resumed turn. Called from gateway after run_conversation() returns.
suspend_recently_active(max_age_seconds=120) Crash recovery: mark recently-active sessions as resume_pending=True. Skips already-pending and already-suspended entries. Called on startup after unclean shutdown.
prune_old_entries(max_age_days) Drop entries older than max_age_days (based on updated_at). Skips suspended entries and sessions with active processes.
list_sessions(active_minutes=None) Return all sessions, optionally filtered by recent activity. Sorted by updated_at descending.
lookup_by_session_id(session_id) Find the active SessionEntry for a persisted session ID.
has_any_sessions() Check if any sessions have ever been created (uses SQLite for history, not just in-memory dict).
append_to_transcript(session_id, message, skip_db=False) Append a message to SQLite transcript. skip_db=True prevents duplicate writes when the agent already persisted.
rewrite_transcript(session_id, messages) Full replacement of session transcript (used by /retry, /undo, /compress).
load_transcript(session_id) Load all messages from a session's SQLite transcript.
rewind_session(session_id, n=1) Back up n user turns via soft-delete (keeps audit trail). Returns {rewound_count, turns_undone, target_text}.

Internal Helpers

  • _ensure_loaded() / _ensure_loaded_locked() — Load sessions.json into _entries dict.
  • _save() — Atomic write to sessions.json via temp file + atomic_replace.
  • _generate_session_key(source) — Delegates to build_session_key() with config params.
  • _is_session_expired(entry) — Policy check from entry alone (no source needed). Used by background expiry watcher.
  • _should_reset(entry, source) — Policy check returning "idle", "daily", or None.

Storage Layout

{sessions_dir}/
  sessions.json          # In-memory _entries dict, persisted as JSON
                           Maps session_key → SessionEntry (metadata only)
  {session_id}.jsonl     # (Legacy, removed in spec 002)

The canonical transcript store is SQLite via SessionDB (from hermes_state). The sessions.json file persists the session_key → session_id mapping and entry metadata (flags, timestamps, token counts). If SQLite is unavailable, the store falls back to JSONL, but this is a degradation path.


4. SessionKey Generation Rules

Session keys are deterministic strings that identify a conversation lane. They are generated by build_session_key(source, group_sessions_per_user, thread_sessions_per_user).

Key Format

agent:main:{platform}:{chat_type}[:{chat_id}][:{thread_id}][:{participant_id}]

DM Rules

Scenario Key
DM with chat_id agent:main:telegram:dm:12345
DM with chat_id + thread agent:main:telegram:dm:12345:thread_678
DM without chat_id, with participant_id agent:main:signal:dm:user_abc
DM without chat_id or participant_id agent:main:telegram:dm
WhatsApp DM (canonicalized) agent:main:whatsapp:dm:{canonical_number}
  • DMs always include chat_id when present, isolating each private conversation.
  • thread_id further differentiates threaded DMs within the same DM chat.
  • Without chat_id, falls back to user_id_alt or user_id as participant_id.
  • Without any identifier, all DMs on that platform collapse to one shared session.

Group/Channel Rules

Scenario Key
Group chat agent:main:telegram:group:-10012345
Group chat, per-user isolation agent:main:telegram:group:-10012345:user_abc
Thread in group, shared agent:main:discord:group:12345:thread_678
Thread in group, per-user agent:main:discord:group:12345:thread_678:user_abc
Channel agent:main:slack:channel:C12345
WhatsApp group (canonicalized) agent:main:whatsapp:group:{canonical_id}:{participant}
  • chat_id identifies the parent group/channel.
  • thread_id differentiates threads within that parent.
  • Per-user isolation (append participant_id) is controlled by:
    • group_sessions_per_user (default: True) — group/channel sessions are isolated.
    • thread_sessions_per_user (default: False) — threads are shared by default (Telegram forum topics, Discord threads, Slack threads all share one session per thread).
  • participant_id = user_id_alt or user_id (in that priority).
  • WhatsApp identifiers are canonicalized to handle JID/LID alias flips.

Special Case: WhatApp

WhatsApp phone numbers go through canonical_whatsapp_identifier() which strips the @s.whatsapp.net suffix and normalizes to E.164 format. This prevents session fragmentation when the bridge returns different alias forms of the same phone number.


5. Multi-User Isolation Strategy

Multi-user isolation determines whether multiple users in the same chat share a conversation or each get their own private session.

Decision Logic (is_shared_multi_user_session)

def is_shared_multi_user_session(source, *, group_sessions_per_user, thread_sessions_per_user):
    if source.chat_type == "dm":
        return False  # DMs are always private
    if source.thread_id:
        return not thread_sessions_per_user  # Threads: shared unless per-user
    return not group_sessions_per_user       # Groups: isolated unless shared

Summary

Chat Type Default Config Control
DM Private (never shared) N/A
Group/Channel Per-user isolation group_sessions_per_user (default: True)
Thread (forum, discord) Shared (all participants see same context) thread_sessions_per_user (default: False)

Impact on System Prompt

When shared_multi_user_session=True, the system prompt omits a fixed user name and instead states: "Multi-user {thread|session} — messages are prefixed with [sender name]. Multiple users may participate." Individual sender names are prefixed on each user message by the gateway at runtime, preserving prompt caching (the system prompt doesn't change per-turn).


6. Reset Policy

Reset policies control when a session automatically loses context (gets a new session_id).

Policy Modes (SessionResetPolicy)

Mode Behavior Default Config
"none" Never auto-reset. Context managed only by compression.
"idle" Reset after N minutes of inactivity from updated_at. idle_minutes: 1440 (24h)
"daily" Reset at a specific hour each day (local time). at_hour: 4 (4 AM)
"both" Whichever triggers first — daily boundary OR idle timeout. (default)

Policy Evaluation

# Idle check
idle_deadline = entry.updated_at + timedelta(minutes=policy.idle_minutes)
if now > idle_deadline: return "idle"

# Daily check
today_reset = now.replace(hour=policy.at_hour, minute=0, second=0, microsecond=0)
if now.hour < policy.at_hour:
    today_reset -= timedelta(days=1)  # Reset hasn't happened yet today
if entry.updated_at < today_reset: return "daily"

Per-Platform/Per-Type Policies

Reset policies are configurable per platform and session type via config.get_reset_policy(). This allows different platforms to have different expiry rules (e.g., Telegram DMs reset after 24h idle, but Slack groups persist indefinitely).

Exclusions

Sessions with active background processes are never expired or reset. The has_active_processes_fn callback checks for running processes when evaluating policies.

Reset Effects

When a reset triggers:

  1. Old session is ended in SQLite (with reason "session_reset").
  2. New session_id is generated (YYYYMMDD_HHMMSS_<8hex>).
  3. New SessionEntry is created with was_auto_reset=True and the reset reason.
  4. reset_had_activity is set if the old session had any turns (total_tokens > 0).
  5. The old AIAgent cache entry is evicted on the next expiry watcher pass.
  6. On the first message after reset, a context notice is injected: "Session expired due to inactivity / daily reset."

7. Restart Recovery Flow

The restart recovery system ensures that in-flight sessions are preserved across gateway restarts, crashes, and drain timeouts. It is the solution to issue #7536.

Startup Recovery Sequence

Gateway starts
       │
       ▼
┌───────────────────────────────┐
│ Check for .clean_shutdown     │── Exists? ──► Skip suspension (clean exit)
│ marker                        │
└───────────────────────────────┘
       │ Missing
       ▼
┌───────────────────────────────┐
│ session_store                 │── Marks sessions updated within
│ .suspend_recently_active()    │   last 120 seconds as resume_pending
└───────────────────────────────┘
       │
       ▼
┌───────────────────────────────┐
│ _suspend_stuck_loop_sessions()│── Suspends sessions that have been
│                               │   active across 3+ restarts
└───────────────────────────────┘
       │
       ▼
┌───────────────────────────────┐
│ Queue inbound messages while  │
│ startup restore runs          │
│ (_startup_restore_in_progress)│
└───────────────────────────────┘
       │
       ▼
┌───────────────────────────────┐
│ For each adapter, find        │
│ resume_pending sessions →     │
│ synthesize MessageEvent and   │
│ run _handle_message to let    │
│ the agent auto-continue       │
└───────────────────────────────┘

suspend_recently_active(max_age_seconds=120)

Called on gateway startup when no .clean_shutdown marker exists (indicating a crash or unexpected exit). For each session updated within the last 120 seconds:

  • Sets resume_pending=True, resume_reason="restart_interrupted", last_resume_marked_at=now.
  • Skips entries already resume_pending=True (no double-mark).
  • Skips entries explicitly suspended=True (hard wipe should stay).

Stuck-Loop Detection (_suspend_stuck_loop_sessions)

Counts consecutive restarts via a JSON file ({HERMES_HOME}/restart_counts.json). If a session has been active across 3+ consecutive restarts, it's auto-suspended so the user gets a clean slate.

Drain-Timeout Marking

On graceful shutdown/restart, the drain system calls mark_resume_pending() for any session that was mid-turn when the drain timeout fired. Reasons:

  • "restart_timeout" — killed during restart drain
  • "shutdown_timeout" — killed during shutdown drain
  • "restart_interrupted" — crash recovery (from suspend_recently_active)

All three reasons are in _AUTO_RESUME_REASONS and eligible for startup auto-resume.

Auto-Resume on Next Access

When get_or_create_session() encounters resume_pending=True:

  1. It returns the existing entry without creating a new session_id.
  2. The existing transcript is loaded intact.
  3. The marking is not cleared here — it survives until the next successful turn completes (clear_resume_pending() is called from the gateway after run_conversation() returns a real response).
  4. If the resumed turn is interrupted again, the resume_pending flag remains set, and the next restart will retry. The stuck-loop counter handles terminal escalation (3 retries → suspended).

Clean Shutdown Marker (.clean_shutdown)

Written at the end of a graceful shutdown. On next startup:

  • If present: skip suspend_recently_active() entirely. Active agents were already drained, so no sessions are stuck.
  • Then delete the marker.

This prevents unwanted auto-resets after hermes update, hermes gateway restart, or /restart.


8. Message Queuing Flow

The message queuing system handles two scenarios:

  1. Interrupt follow-ups — When a user sends multiple messages while the agent is processing, subsequent messages are queued as single-slot pending messages.
  2. /queue FIFO — Explicit /queue commands that must each produce their own full agent turn, in order, without merging.

Data Structures

adapter._pending_messages: Dict[session_key, MessageEvent]
    └── Single "next-up" slot per session. Overwritten on repeat sends
        (burst collapse). Shared with photo-burst follow-ups.

self._queued_events: Dict[session_key, List[MessageEvent]]
    └── Overflow buffer. Each /queue invocation appends here when the
        slot is occupied. Promoted one-at-a-time after each drain.

Enqueue (_enqueue_fifo)

_enqueue_fifo(session_key, event, adapter)
       │
       ▼
┌───────────────────────────────────────┐
│ Is slot free?                         │
│ (session_key NOT in _pending_messages)│── Yes ──► Place event in slot
└───────────────────────────────────────┘
       │ No
       ▼
Append to _queued_events[session_key] (overflow tail)

Dequeue / Promotion (_promote_queued_event)

Called at the drain site after the slot was consumed. If there's an overflow item:

  • When pending_event is None (slot was empty), return overflow head as the new event.
  • When pending_event exists, stage overflow head in the slot for the next recursion.
  • If no adapter available, push back to _queued_events (don't silently drop).

Queue Depth

_queue_depth(session_key, adapter) returns len(overflow) + (1 if slot occupied else 0).

Clearing

Queued events for a session are cleared on /new and /reset (via _handle_reset_command).

FIFO Invariant

Each /queue invocation produces exactly one full agent turn, in FIFO order, with no merging. The single-slot _pending_messages + overflow _queued_events design ensures that repeated sends during an active turn don't cause out-of-order processing.


9. Session Context Injection

SessionContext is built from a SessionSource and GatewayConfig and injected into the agent's system prompt. It tells the agent:

  • Where the current message came from
  • What platforms are connected
  • Where it can deliver scheduled task outputs
  • Whether this is a shared multi-user session

Construction (build_session_context)

def build_session_context(source, config, session_entry=None) -> SessionContext
  1. Collects connected platforms from config.
  2. Collects home channels for each platform.
  3. Determines shared_multi_user_session via is_shared_multi_user_session().
  4. Attaches session metadata (key, id, timestamps) if session_entry is provided.

PII Redaction (build_session_context_prompt)

The dynamic system prompt section (## Current Session Context) can optionally redact personally identifiable information before sending to the LLM:

  • User IDs → user_<12hex> (SHA-256 prefix)
  • Chat IDs → <platform>:<12hex> or just <12hex>
  • Platforms excluded from redaction: Discord (needs raw IDs for @mentions), and any plugin-registered platform not marked pii_safe.

Redaction applies only to the system prompt text. Routing, session keys, and adapter operations always use the original values.


10. Background Expiry Watcher

The _session_expiry_watcher task runs in the gateway event loop every 300 seconds (5 min).

Responsibilities

  1. Finalize expired sessions — For each entry where _is_session_expired() returns True and expiry_finalized is False:

    • Invoke on_session_finalize plugin hooks (cleanup, notifications).
    • Clean up cached AIAgent resources (close tool resources, shut down memory provider).
    • Evict the cached agent entry.
    • Clear per-session overrides (_session_model_overrides, reasoning overrides, etc.).
    • Mark expiry_finalized=True and persist.
  2. Sweep idle cached agents — Calls _sweep_idle_cached_agents() to evict agents that have been idle beyond _AGENT_CACHE_IDLE_TTL_SECS (3600s / 1h), regardless of session reset policy. This prevents unbounded memory growth in gateways with long-lived sessions.

  3. Prune stale entries — Calls session_store.prune_old_entries() hourly based on config.session_store_max_age_days. Prevents sessions.json from growing unbounded.

Failure Handling

  • Per-session retry count: each failed finalize is retried up to 3 consecutive times.
  • After 3 failures, the entry is force-marked expiry_finalized=True to prevent infinite retry loops.

11. Agent Cache

The gateway maintains an LRU cache of AIAgent instances keyed by session_key to preserve prompt caching across turns.

Cache Properties

  • Max size: 128 entries (_AGENT_CACHE_MAX_SIZE).
  • Eviction policy: Least-recently-used (LRU via OrderedDict).
  • Idle TTL: 3600s (1h) — enforced by _session_expiry_watcher.
  • Lock: _agent_cache_lock (threading) for thread safety.

Cache Lifecycle

Message arrives
    │
    ▼
get_or_create_session()  →  session_key obtained
    │
    ▼
Lookup _agent_cache[session_key]
    │
    ├── Hit → move_to_end(), reuse AIAgent (preserves prompt cache)
    │
    └── Miss → create new AIAgent, store in cache
                (if at capacity, popitem(last=False) evicts LRU entry)
    │
    ▼
run_conversation()  →  agent processes message
    │
    ▼
Session expiry watcher evicts agent when session finalizes

Cleanup Flow

When a session expires:

  1. _cleanup_agent_resources(agent) — shuts down memory provider, closes tool resources.
  2. _evict_cached_agent(key) — removes from _agent_cache so the agent can be GC'd.

Appendix: Key Configuration

Config Key Type Default Description
group_sessions_per_user bool true Isolate group/channel sessions per user
thread_sessions_per_user bool false Isolate thread sessions per user
session_store_max_age_days int 0 Prune sessions older than N days (0=disabled)
agent.gateway_auto_continue_freshness int 3600 Seconds for resume freshness window
agent.gateway_timeout int 1800 Agent turn timeout (30 min default)

Reset Policy (per-platform/type, in config.yaml)

session_reset:
  mode: both            # none | idle | daily | both
  at_hour: 4            # daily reset hour (local time)
  idle_minutes: 1440    # idle timeout (24h)
  notify: true          # notify user on auto-reset

Platform-specific overrides can be set under platforms.<name>.session_reset.