30 KiB
Session Lifecycle
Audience: Gateway developers and maintainers Source files:
gateway/session.py(~1444 lines),gateway/run.py(~16800 lines),gateway/config.pyLast updated: 2026-06-16
Overview
A session represents a continuous conversation between the agent and one or more users on a messaging platform. The session lifecycle governs when conversations persist, when they reset, how they survive gateway restarts, and how messages queue during concurrent operations.
The session system lives primarily in two modules:
gateway/session.py— Data model (SessionSource,SessionEntry,SessionContext), key generation (build_session_key), and the main store (SessionStore).gateway/run.py— Gateway runner (GatewayRunner) that wires sessions into the message processing pipeline: session expiry watching, agent caching, restart recovery, and message queuing.
1. SessionSource — Message Origin Descriptor
SessionSource is a frozen record of where a message came from. It is attached to every
incoming MessageEvent and used for routing, isolation, and context injection.
Fields
| Field | Type | Default | Description |
|---|---|---|---|
platform |
Platform |
(required) | Enum identifying the messaging platform (telegram, discord, slack, signal, whatsapp, matrix, local, etc.). |
chat_id |
str |
(required) | Platform-level chat/group/channel identifier. Routed through the adapter's chat_id_key transform. |
chat_name |
Optional[str] |
None |
Human-readable name of the chat or group. |
chat_type |
str |
"dm" |
One of "dm", "group", "channel", "thread". Controls session key generation and isolation. |
user_id |
Optional[str] |
None |
Platform-specific user identifier. Used for authorization and per-user session isolation. |
user_name |
Optional[str] |
None |
Display name of the message author. Injected into system prompt. |
thread_id |
Optional[str] |
None |
Forum topic / Discord thread / Slack thread identifier. Differentiates threaded conversations. |
chat_topic |
Optional[str] |
None |
Channel topic or description (Discord channel topic, Slack channel purpose). |
user_id_alt |
Optional[str] |
None |
Platform-specific stable alternative ID (Signal UUID, Feishu union_id). Used when user_id is ephemeral. |
chat_id_alt |
Optional[str] |
None |
Signal group internal ID — maps a Signal group V2 identifier to its canonical form. |
is_bot |
bool |
False |
True when the message author is a bot or webhook (Discord bots). |
guild_id |
Optional[str] |
None |
Discord guild / Slack workspace / Matrix server scope identifier. |
parent_chat_id |
Optional[str] |
None |
Parent channel when chat_id refers to a thread. |
message_id |
Optional[str] |
None |
ID of the triggering message. Used for pin/reply/react operations and Discord ID injection. |
role_authorized |
bool |
False |
True when adapter granted access via a platform role (not individual user ID). |
Key Methods
description(property:str) — Human-readable summary e.g."DM with Alice","group: My Group, thread: 12345".to_dict()/from_dict()— Serialization round-trip for persistence insessions.json.
2. SessionEntry — Active Session Record
SessionEntry is the per-session metadata record stored in memory and persisted to
{sessions_dir}/sessions.json. Each entry maps a session_key to its current session_id.
Fields
| Field | Type | Default | Description |
|---|---|---|---|
session_key |
str |
(required) | Deterministic key identifying the conversation lane (see §4). |
session_id |
str |
(required) | Unique identifier for this specific conversation incarnation. Format: YYYYMMDD_HHMMSS_<8hex>. |
created_at |
datetime |
(required) | When this session incarnation was created. |
updated_at |
datetime |
(required) | Last activity timestamp. Used for idle timeout and expiry checks. |
origin |
Optional[SessionSource] |
None |
The source that created this session, used for delivery routing. |
display_name |
Optional[str] |
None |
Chat display name (sourced from SessionSource.chat_name). |
platform |
Optional[Platform] |
None |
Platform enum, persisted for expiry policy lookup across restarts. |
chat_type |
str |
"dm" |
Chat type, also persisted for policy lookup. |
input_tokens |
int |
0 |
Cumulative LLM input (prompt) tokens consumed. |
output_tokens |
int |
0 |
Cumulative LLM output (completion) tokens consumed. |
cache_read_tokens |
int |
0 |
Cumulative prompt cache read tokens. |
cache_write_tokens |
int |
0 |
Cumulative prompt cache write tokens. |
total_tokens |
int |
0 |
Total token count across all turns. |
estimated_cost_usd |
float |
0.0 |
Estimated cumulative USD cost. |
cost_status |
str |
"unknown" |
Cost tracking status label. |
last_prompt_tokens |
int |
0 |
Last API-reported prompt token count. Used for accurate compression pre-check. |
Boolean Flags (State Machine)
SessionEntry has several boolean flags that form a simple state machine governing session behavior on the next access.
| Flag | Type | Default | Description |
|---|---|---|---|
was_auto_reset |
bool |
False |
Set when a session was auto-reset due to policy expiry (idle/daily). Consumed once to inject a context notice. |
auto_reset_reason |
Optional[str] |
None |
"idle" or "daily" — why the previous session was auto-reset. |
reset_had_activity |
bool |
False |
Whether the expired session had any messages (total_tokens > 0). |
is_fresh_reset |
bool |
False |
Set by explicit /new or /reset. Triggers topic/channel skill re-injection on first message. Distinguished from was_auto_reset to avoid misleading "session expired" notices. |
expiry_finalized |
bool |
False |
Set by background expiry watcher after invoking on_session_finalize hooks, cleaning tool resources, and evicting the cached agent. Prevents redundant finalization across restarts. |
suspended |
bool |
False |
Hard force-wipe signal. Set by /stop or stuck-loop escalation (3+ consecutive restart failures). On next get_or_create_session(), forces a new session_id regardless of resume_pending. |
resume_pending |
bool |
False |
Soft recovery marker. Set by suspend_recently_active() (crash recovery) or drain timeout. On next access, preserves the existing session_id — the user continues on the same transcript. Cleared after the next successful turn completes. |
resume_reason |
Optional[str] |
None |
Why resume was marked: "restart_timeout", "shutdown_timeout", "restart_interrupted". |
last_resume_marked_at |
Optional[datetime] |
None |
Timestamp of the last resume-pending marking. |
State Transition Logic (get_or_create_session)
┌──────────┐
│ Incoming │
│ Message │
└────┬─────┘
│
▼
┌──────────────────────┐
│ session_key exists │──── No ──► Create fresh SessionEntry
│ AND !force_new │
└──────────┬───────────┘
│ Yes
▼
┌──────────────────────┐
│ entry.suspended? │──── Yes ──► Auto-reset: new session_id
└──────────┬───────────┘ (reason="suspended")
│ No
▼
┌──────────────────────┐
│ entry.resume_pending?│──── Yes ──► Return existing entry
└──────────┬───────────┘ (preserve session_id)
│ No Clear flag on next successful turn
▼
┌──────────────────────┐
│ Policy says reset? │──── Yes ──► Auto-reset: new session_id
└──────────┬───────────┘ (reason="idle"/"daily")
│ No
▼
┌──────────────────────┐
│ Return existing │
│ entry, bump │
│ updated_at │
└──────────────────────┘
Priority order in get_or_create_session():
suspended=True→ always force-reset (hard wipe)resume_pending=True→ preserve session_id (soft recovery)- Policy expiry (idle/daily) → auto-reset
- No trigger → return existing entry (bump
updated_at)
3. SessionStore — Storage and Operations
SessionStore is the main storage layer. It maintains an in-memory dict (_entries) persisted
to sessions.json, with SQLite (SessionDB) as the canonical store for session metadata and
message transcripts.
Constructor
SessionStore(sessions_dir: Path, config: GatewayConfig, has_active_processes_fn=None)
sessions_dir— Directory wheresessions.jsonlives.config—GatewayConfiginstance for reset policy lookups.has_active_processes_fn— Optional callback keyed bysession_keyto check for running background processes. Sessions with active processes are never expired or pruned.
Operations (Methods)
| Method | Description |
|---|---|
get_or_create_session(source, force_new=False) |
Core entry point. Returns existing or creates new SessionEntry. Evaluates suspended, resume_pending, and reset policy. Creates/ends SQLite records. |
update_session(session_key, last_prompt_tokens=None) |
Lightweight metadata update after an interaction. Bumps updated_at, optionally records last_prompt_tokens. |
reset_session(session_key, display_name=None) |
Explicit reset (from /new or /reset). Creates new session_id, sets is_fresh_reset=True. Ends old SQLite session, creates new one. |
switch_session(session_key, target_session_id) |
Switch to a different existing session ID (from /resume). Ends current SQLite session, reopens target. |
suspend_session(session_key) |
Mark session as suspended=True (from /stop). Forces auto-reset on next access. |
mark_resume_pending(session_key, reason) |
Mark session as resume_pending=True (from drain timeout). Preserves session_id on next access. Will NOT override suspended=True. |
clear_resume_pending(session_key) |
Clear resume_pending after a successful resumed turn. Called from gateway after run_conversation() returns. |
suspend_recently_active(max_age_seconds=120) |
Crash recovery: mark recently-active sessions as resume_pending=True. Skips already-pending and already-suspended entries. Called on startup after unclean shutdown. |
prune_old_entries(max_age_days) |
Drop entries older than max_age_days (based on updated_at). Skips suspended entries and sessions with active processes. |
list_sessions(active_minutes=None) |
Return all sessions, optionally filtered by recent activity. Sorted by updated_at descending. |
lookup_by_session_id(session_id) |
Find the active SessionEntry for a persisted session ID. |
has_any_sessions() |
Check if any sessions have ever been created (uses SQLite for history, not just in-memory dict). |
append_to_transcript(session_id, message, skip_db=False) |
Append a message to SQLite transcript. skip_db=True prevents duplicate writes when the agent already persisted. |
rewrite_transcript(session_id, messages) |
Full replacement of session transcript (used by /retry, /undo, /compress). |
load_transcript(session_id) |
Load all messages from a session's SQLite transcript. |
rewind_session(session_id, n=1) |
Back up n user turns via soft-delete (keeps audit trail). Returns {rewound_count, turns_undone, target_text}. |
Internal Helpers
_ensure_loaded()/_ensure_loaded_locked()— Loadsessions.jsoninto_entriesdict._save()— Atomic write tosessions.jsonvia temp file +atomic_replace._generate_session_key(source)— Delegates tobuild_session_key()with config params._is_session_expired(entry)— Policy check from entry alone (no source needed). Used by background expiry watcher._should_reset(entry, source)— Policy check returning"idle","daily", orNone.
Storage Layout
{sessions_dir}/
sessions.json # In-memory _entries dict, persisted as JSON
Maps session_key → SessionEntry (metadata only)
{session_id}.jsonl # (Legacy, removed in spec 002)
The canonical transcript store is SQLite via SessionDB (from hermes_state). The
sessions.json file persists the session_key → session_id mapping and entry metadata
(flags, timestamps, token counts). If SQLite is unavailable, the store falls back to
JSONL, but this is a degradation path.
4. SessionKey Generation Rules
Session keys are deterministic strings that identify a conversation lane. They are generated
by build_session_key(source, group_sessions_per_user, thread_sessions_per_user).
Key Format
agent:main:{platform}:{chat_type}[:{chat_id}][:{thread_id}][:{participant_id}]
DM Rules
| Scenario | Key |
|---|---|
| DM with chat_id | agent:main:telegram:dm:12345 |
| DM with chat_id + thread | agent:main:telegram:dm:12345:thread_678 |
| DM without chat_id, with participant_id | agent:main:signal:dm:user_abc |
| DM without chat_id or participant_id | agent:main:telegram:dm |
| WhatsApp DM (canonicalized) | agent:main:whatsapp:dm:{canonical_number} |
- DMs always include
chat_idwhen present, isolating each private conversation. thread_idfurther differentiates threaded DMs within the same DM chat.- Without
chat_id, falls back touser_id_altoruser_idas participant_id. - Without any identifier, all DMs on that platform collapse to one shared session.
Group/Channel Rules
| Scenario | Key |
|---|---|
| Group chat | agent:main:telegram:group:-10012345 |
| Group chat, per-user isolation | agent:main:telegram:group:-10012345:user_abc |
| Thread in group, shared | agent:main:discord:group:12345:thread_678 |
| Thread in group, per-user | agent:main:discord:group:12345:thread_678:user_abc |
| Channel | agent:main:slack:channel:C12345 |
| WhatsApp group (canonicalized) | agent:main:whatsapp:group:{canonical_id}:{participant} |
chat_ididentifies the parent group/channel.thread_iddifferentiates threads within that parent.- Per-user isolation (append
participant_id) is controlled by:group_sessions_per_user(default:True) — group/channel sessions are isolated.thread_sessions_per_user(default:False) — threads are shared by default (Telegram forum topics, Discord threads, Slack threads all share one session per thread).
participant_id=user_id_altoruser_id(in that priority).- WhatsApp identifiers are canonicalized to handle JID/LID alias flips.
Special Case: WhatApp
WhatsApp phone numbers go through canonical_whatsapp_identifier() which strips the
@s.whatsapp.net suffix and normalizes to E.164 format. This prevents session fragmentation
when the bridge returns different alias forms of the same phone number.
5. Multi-User Isolation Strategy
Multi-user isolation determines whether multiple users in the same chat share a conversation or each get their own private session.
Decision Logic (is_shared_multi_user_session)
def is_shared_multi_user_session(source, *, group_sessions_per_user, thread_sessions_per_user):
if source.chat_type == "dm":
return False # DMs are always private
if source.thread_id:
return not thread_sessions_per_user # Threads: shared unless per-user
return not group_sessions_per_user # Groups: isolated unless shared
Summary
| Chat Type | Default | Config Control |
|---|---|---|
| DM | Private (never shared) | N/A |
| Group/Channel | Per-user isolation | group_sessions_per_user (default: True) |
| Thread (forum, discord) | Shared (all participants see same context) | thread_sessions_per_user (default: False) |
Impact on System Prompt
When shared_multi_user_session=True, the system prompt omits a fixed user name and instead
states: "Multi-user {thread|session} — messages are prefixed with [sender name]. Multiple
users may participate." Individual sender names are prefixed on each user message by the
gateway at runtime, preserving prompt caching (the system prompt doesn't change per-turn).
6. Reset Policy
Reset policies control when a session automatically loses context (gets a new session_id).
Policy Modes (SessionResetPolicy)
| Mode | Behavior | Default Config |
|---|---|---|
"none" |
Never auto-reset. Context managed only by compression. | — |
"idle" |
Reset after N minutes of inactivity from updated_at. |
idle_minutes: 1440 (24h) |
"daily" |
Reset at a specific hour each day (local time). | at_hour: 4 (4 AM) |
"both" |
Whichever triggers first — daily boundary OR idle timeout. | (default) |
Policy Evaluation
# Idle check
idle_deadline = entry.updated_at + timedelta(minutes=policy.idle_minutes)
if now > idle_deadline: return "idle"
# Daily check
today_reset = now.replace(hour=policy.at_hour, minute=0, second=0, microsecond=0)
if now.hour < policy.at_hour:
today_reset -= timedelta(days=1) # Reset hasn't happened yet today
if entry.updated_at < today_reset: return "daily"
Per-Platform/Per-Type Policies
Reset policies are configurable per platform and session type via config.get_reset_policy().
This allows different platforms to have different expiry rules (e.g., Telegram DMs reset
after 24h idle, but Slack groups persist indefinitely).
Exclusions
Sessions with active background processes are never expired or reset. The
has_active_processes_fn callback checks for running processes when evaluating policies.
Reset Effects
When a reset triggers:
- Old session is ended in SQLite (with reason
"session_reset"). - New
session_idis generated (YYYYMMDD_HHMMSS_<8hex>). - New
SessionEntryis created withwas_auto_reset=Trueand the reset reason. reset_had_activityis set if the old session had any turns (total_tokens > 0).- The old AIAgent cache entry is evicted on the next expiry watcher pass.
- On the first message after reset, a context notice is injected: "Session expired due to inactivity / daily reset."
7. Restart Recovery Flow
The restart recovery system ensures that in-flight sessions are preserved across gateway restarts, crashes, and drain timeouts. It is the solution to issue #7536.
Startup Recovery Sequence
Gateway starts
│
▼
┌───────────────────────────────┐
│ Check for .clean_shutdown │── Exists? ──► Skip suspension (clean exit)
│ marker │
└───────────────────────────────┘
│ Missing
▼
┌───────────────────────────────┐
│ session_store │── Marks sessions updated within
│ .suspend_recently_active() │ last 120 seconds as resume_pending
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ _suspend_stuck_loop_sessions()│── Suspends sessions that have been
│ │ active across 3+ restarts
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ Queue inbound messages while │
│ startup restore runs │
│ (_startup_restore_in_progress)│
└───────────────────────────────┘
│
▼
┌───────────────────────────────┐
│ For each adapter, find │
│ resume_pending sessions → │
│ synthesize MessageEvent and │
│ run _handle_message to let │
│ the agent auto-continue │
└───────────────────────────────┘
suspend_recently_active(max_age_seconds=120)
Called on gateway startup when no .clean_shutdown marker exists (indicating a crash or
unexpected exit). For each session updated within the last 120 seconds:
- Sets
resume_pending=True,resume_reason="restart_interrupted",last_resume_marked_at=now. - Skips entries already
resume_pending=True(no double-mark). - Skips entries explicitly
suspended=True(hard wipe should stay).
Stuck-Loop Detection (_suspend_stuck_loop_sessions)
Counts consecutive restarts via a JSON file ({HERMES_HOME}/restart_counts.json). If a
session has been active across 3+ consecutive restarts, it's auto-suspended so the user
gets a clean slate.
Drain-Timeout Marking
On graceful shutdown/restart, the drain system calls mark_resume_pending() for any
session that was mid-turn when the drain timeout fired. Reasons:
"restart_timeout"— killed during restart drain"shutdown_timeout"— killed during shutdown drain"restart_interrupted"— crash recovery (fromsuspend_recently_active)
All three reasons are in _AUTO_RESUME_REASONS and eligible for startup auto-resume.
Auto-Resume on Next Access
When get_or_create_session() encounters resume_pending=True:
- It returns the existing entry without creating a new
session_id. - The existing transcript is loaded intact.
- The marking is not cleared here — it survives until the next successful turn
completes (
clear_resume_pending()is called from the gateway afterrun_conversation()returns a real response). - If the resumed turn is interrupted again, the
resume_pendingflag remains set, and the next restart will retry. The stuck-loop counter handles terminal escalation (3 retries → suspended).
Clean Shutdown Marker (.clean_shutdown)
Written at the end of a graceful shutdown. On next startup:
- If present: skip
suspend_recently_active()entirely. Active agents were already drained, so no sessions are stuck. - Then delete the marker.
This prevents unwanted auto-resets after hermes update, hermes gateway restart,
or /restart.
8. Message Queuing Flow
The message queuing system handles two scenarios:
- Interrupt follow-ups — When a user sends multiple messages while the agent is processing, subsequent messages are queued as single-slot pending messages.
/queueFIFO — Explicit/queuecommands that must each produce their own full agent turn, in order, without merging.
Data Structures
adapter._pending_messages: Dict[session_key, MessageEvent]
└── Single "next-up" slot per session. Overwritten on repeat sends
(burst collapse). Shared with photo-burst follow-ups.
self._queued_events: Dict[session_key, List[MessageEvent]]
└── Overflow buffer. Each /queue invocation appends here when the
slot is occupied. Promoted one-at-a-time after each drain.
Enqueue (_enqueue_fifo)
_enqueue_fifo(session_key, event, adapter)
│
▼
┌───────────────────────────────────────┐
│ Is slot free? │
│ (session_key NOT in _pending_messages)│── Yes ──► Place event in slot
└───────────────────────────────────────┘
│ No
▼
Append to _queued_events[session_key] (overflow tail)
Dequeue / Promotion (_promote_queued_event)
Called at the drain site after the slot was consumed. If there's an overflow item:
- When
pending_event is None(slot was empty), return overflow head as the new event. - When
pending_eventexists, stage overflow head in the slot for the next recursion. - If no adapter available, push back to
_queued_events(don't silently drop).
Queue Depth
_queue_depth(session_key, adapter) returns len(overflow) + (1 if slot occupied else 0).
Clearing
Queued events for a session are cleared on /new and /reset (via _handle_reset_command).
FIFO Invariant
Each /queue invocation produces exactly one full agent turn, in FIFO order, with no
merging. The single-slot _pending_messages + overflow _queued_events design ensures
that repeated sends during an active turn don't cause out-of-order processing.
9. Session Context Injection
SessionContext is built from a SessionSource and GatewayConfig and injected into the
agent's system prompt. It tells the agent:
- Where the current message came from
- What platforms are connected
- Where it can deliver scheduled task outputs
- Whether this is a shared multi-user session
Construction (build_session_context)
def build_session_context(source, config, session_entry=None) -> SessionContext
- Collects connected platforms from config.
- Collects home channels for each platform.
- Determines
shared_multi_user_sessionviais_shared_multi_user_session(). - Attaches session metadata (key, id, timestamps) if
session_entryis provided.
PII Redaction (build_session_context_prompt)
The dynamic system prompt section (## Current Session Context) can optionally redact
personally identifiable information before sending to the LLM:
- User IDs →
user_<12hex>(SHA-256 prefix) - Chat IDs →
<platform>:<12hex>or just<12hex> - Platforms excluded from redaction: Discord (needs raw IDs for
@mentions), and any plugin-registered platform not markedpii_safe.
Redaction applies only to the system prompt text. Routing, session keys, and adapter operations always use the original values.
10. Background Expiry Watcher
The _session_expiry_watcher task runs in the gateway event loop every 300 seconds (5 min).
Responsibilities
-
Finalize expired sessions — For each entry where
_is_session_expired()returns True andexpiry_finalizedis False:- Invoke
on_session_finalizeplugin hooks (cleanup, notifications). - Clean up cached AIAgent resources (close tool resources, shut down memory provider).
- Evict the cached agent entry.
- Clear per-session overrides (
_session_model_overrides, reasoning overrides, etc.). - Mark
expiry_finalized=Trueand persist.
- Invoke
-
Sweep idle cached agents — Calls
_sweep_idle_cached_agents()to evict agents that have been idle beyond_AGENT_CACHE_IDLE_TTL_SECS(3600s / 1h), regardless of session reset policy. This prevents unbounded memory growth in gateways with long-lived sessions. -
Prune stale entries — Calls
session_store.prune_old_entries()hourly based onconfig.session_store_max_age_days. Preventssessions.jsonfrom growing unbounded.
Failure Handling
- Per-session retry count: each failed finalize is retried up to 3 consecutive times.
- After 3 failures, the entry is force-marked
expiry_finalized=Trueto prevent infinite retry loops.
11. Agent Cache
The gateway maintains an LRU cache of AIAgent instances keyed by session_key to
preserve prompt caching across turns.
Cache Properties
- Max size: 128 entries (
_AGENT_CACHE_MAX_SIZE). - Eviction policy: Least-recently-used (LRU via
OrderedDict). - Idle TTL: 3600s (1h) — enforced by
_session_expiry_watcher. - Lock:
_agent_cache_lock(threading) for thread safety.
Cache Lifecycle
Message arrives
│
▼
get_or_create_session() → session_key obtained
│
▼
Lookup _agent_cache[session_key]
│
├── Hit → move_to_end(), reuse AIAgent (preserves prompt cache)
│
└── Miss → create new AIAgent, store in cache
(if at capacity, popitem(last=False) evicts LRU entry)
│
▼
run_conversation() → agent processes message
│
▼
Session expiry watcher evicts agent when session finalizes
Cleanup Flow
When a session expires:
_cleanup_agent_resources(agent)— shuts down memory provider, closes tool resources._evict_cached_agent(key)— removes from_agent_cacheso the agent can be GC'd.
Appendix: Key Configuration
| Config Key | Type | Default | Description |
|---|---|---|---|
group_sessions_per_user |
bool |
true |
Isolate group/channel sessions per user |
thread_sessions_per_user |
bool |
false |
Isolate thread sessions per user |
session_store_max_age_days |
int |
0 |
Prune sessions older than N days (0=disabled) |
agent.gateway_auto_continue_freshness |
int |
3600 |
Seconds for resume freshness window |
agent.gateway_timeout |
int |
1800 |
Agent turn timeout (30 min default) |
Reset Policy (per-platform/type, in config.yaml)
session_reset:
mode: both # none | idle | daily | both
at_hour: 4 # daily reset hour (local time)
idle_minutes: 1440 # idle timeout (24h)
notify: true # notify user on auto-reset
Platform-specific overrides can be set under platforms.<name>.session_reset.