hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-22 10:32:00 +00:00

Author	SHA1	Message	Date
Teknium	680732c104	fix(gateway): never interrupt a busy session with an internal completion event (#49738 ) Async-delegation completions (delegate_task(background=true)) and background-process completions (terminal notify_on_complete) re-enter the originating session as internal MessageEvents. When the session was busy, _handle_active_session_busy_message treated them like a user TEXT message and the default busy_input_mode='interrupt' aborted the active turn (and sent a 'Interrupting current task' ack) — the opposite of the design invariant that a completion surfaces as a new turn only when idle. Short-circuit internal events to return False so the base adapter queues them silently (it already excludes internal events from debounce), cascading them as the next turn after the current one finishes.	2026-06-20 10:57:41 -07:00
kshitijk4poor	1fbf48d4ad	fix(compression): make in-place compaction durable + rotation-independent end-to-end Review (Codex + 3-agent parallel) found the first cut of in-place mode was incomplete: it only updated the system prompt, so the persisted transcript stayed 'full history + summary' and the next turn/resume reloaded the full history and immediately re-compacted (a loop), and every downstream layer that keyed off session-id rotation silently no-op'd. The session_id was doing double duty as the 'compaction happened' signal. This wires the whole path so removing rotation is actually complete: Agent (agent/conversation_compression.py): - In-place now DURABLY replaces the transcript: replace_messages(session_id, compressed) on the same row (the canonical store the gateway reloads from), not just update_system_prompt. Resume reloads the compacted set; no loop. - Reset flush identity/cursor (_last_flushed_db_idx=0, _flushed_db_message_ids cleared) so next-turn appends diff against the compacted transcript. - Expose a rotation-independent signal: agent._last_compaction_in_place, and in_place=True on the session:compress event. - Fire the compaction-boundary hooks (context-engine on_session_start, memory manager on_session_switch, reason='compression') in BOTH modes — in-place passes the same id as parent so DAG/buffer state still checkpoints. Without this, memory/context plugins miss every in-place compaction. Gateway auto-compress (gateway/run.py): - Read agent._last_compaction_in_place; set history_offset=0 on rotation OR in-place (both return the compacted set, so slicing past the pre-compaction length would drop everything). Carry compacted_in_place in the result dict. - No extra rewrite needed: the agent shares the gateway's SessionDB, so its replace_messages already updated the canonical store load_transcript reads. Manual /compress (gateway/slash_commands.py): - The throwaway /compress agent has no _session_db, so rewrite_transcript is the durable write. Previously gated behind 'if rotated:' which treated 'id unchanged' as the #44794 data-loss failure case and SKIPPED the rewrite — making /compress a silent no-op in in-place mode. Now rewrites on rotated OR in_place; the data-loss guard still fires only for the genuine no-rotation-AND-not-in-place failure. Hygiene auto-compress already writes _compressed to the same id unconditionally (its agent has no _session_db, can't rotate) — correct for in-place, no change. Tests (tests/run_agent/test_in_place_compaction.py): - Assert the DURABLE transcript IS the compacted set after reload (get_messages_as_conversation == compacted), message_count==2, flush identity reset, and the rotation-independent signal set on in-place / unset on rotation. Rotation regression guard unchanged. Verified: 64 tests green across in-place + rotation/persistence/boundary/ concurrent/failure-sync/command/cli suites; E2E both modes (durable replace, gateway offset=0, rotation preserves old transcript); ruff clean. Still default-off.	2026-06-20 10:57:07 -07:00
Teknium	5600105478	refactor(gateway): migrate slack/dingtalk/whatsapp/matrix/feishu/telegram/wecom/email/sms adapters to bundled plugins Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite, wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained bundled plugins under plugins/platforms/<x>/, discovered via the platform registry. Strips the per-platform core touchpoints from gateway/run.py, gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and tools/send_message_tool.py. Carries forward the migration fixes (explicit enabled:false honored, get_connected_platforms forces discovery, plugin is_connected via gateway.get_env_value, logs --component gateway matches plugins.platforms.*, matrix hidden on Windows). Additionally ports config keys main added since the PR base: the matrix plugin's _apply_yaml_config now also covers allowed_users, ignore_user_patterns, process_notices, and session_scope (the inline gateway/config.py matrix block gained these in the 1340 commits the PR sat open; they would otherwise have been silently dropped on deletion).	2026-06-20 10:26:45 -07:00
lkz-de	96db7c6883	fix(signal): preserve quoted reply context Carry Signal quote metadata through gateway events so replies to assistant messages include the quoted context without personalizing comments.	2026-06-20 15:16:53 +05:30
joaomarcos	75ed07ace8	fix(gateway): break the restart loop at the source on session resume When a tool call itself restarts the gateway (docker restart, systemctl restart, and similar), the process is terminated mid-call — before the tool result is persisted and before the orderly drain rewind can run. The transcript tail is left as an assistant(tool_calls) with no matching tool answer. On resume the model re-issues the unanswered call, taking the gateway down again — an infinite loop (#49201). Source fix: _build_gateway_agent_history now strips a trailing assistant(tool_calls) block that has no tool answers (_strip_dangling_tool_call_tail), so there is nothing for the model to re-execute. This complements _strip_interrupted_tool_tails, which only handles the case where a tool result row exists with an interrupt marker. Cognitive backstop: the resume-pending system note now states that any restart command in the history already ran and must not be re-executed or verified, and the empty-message auto-resume startup turn reports recovery and asks for instructions instead of the nonsensical "address the user's NEW message" (there is no new message on that turn). Reimplements the intent of #49243 by @JoaoMarcos44 at the replay layer. Fixes #49201	2026-06-19 16:59:58 -07:00
alt-glitch	f3e967aae5	fix(mcp): round-3 polish — generation capture adjacency + gateway contract note Third review pass (Hermes subagent) declared convergence: no BLOCKING, the round-2 generation-aware publish / context-engine staging / CLI reload / ACP routing all verified correct by hand and by test. - agent_init: capture _tool_snapshot_generation immediately before the tool snapshot (was ~425 lines earlier); removes a harmless skew window so the recorded generation always matches the snapshot it describes. - gateway/run.py _execute_mcp_reload: keep preserving each cached agent's build-time enabled_toolsets EXACTLY (do NOT merge newly-connected servers like CLI/TUI do) and document WHY — gateway sessions can be deliberately locked down, and test_reload_mcp_preserves_per_agent_toolset_overrides asserts this. A reviewer suggested "parity" here; it would have violated that contract.	2026-06-19 11:57:43 -07:00
alt-glitch	93d6e73028	fix(mcp): expose late-connecting MCP tools to the agent (TUI/CLI/gateway) MCP servers that connect after the agent's one-time tool snapshot were invisible for the whole session. Two root causes, fixed together: 1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers commonly take 2-6s on a cold connect, so they missed the window and their tools never entered the agent's snapshot. `thread.join(timeout)` already returns the instant discovery completes, so raising the bound costs ~0s for the common case (no MCP / fast servers) and only ever blocks for a genuinely-pending server, capped so a dead server can't freeze startup. The bound is now configurable via `mcp_discovery_timeout` (config.yaml, default 5.0s). 2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI `reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh thread), and the late-refresh detected changes by tool COUNT — missing an equal-size add/remove swap. Consolidated into one shared `tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by tool NAME, mutates the agent under a lock (thread-safe), and respects the agent's own enabled/disabled toolsets. The late-binding refresh keeps its pre-first-turn cache-safety guard: it never rebuilds the tool list once a turn has started, so the cached prompt prefix is never invalidated mid-conversation. Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the name-based diff, in-place mutation, agent-scoped filtering, thread safety, and the config-driven discovery bound (incl. instant-return when nothing is pending). 75 passed across the touched areas.	2026-06-19 11:57:43 -07:00
Teknium	2a5e9d994a	Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)	2026-06-19 07:51:59 -07:00
Ben	1928aa0443	fix(managed-scope): honor managed scope in config→env bridges too Manual verification surfaced a second bypass class beyond the standalone config loaders: several code paths bridge config.yaml values into os.environ (HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*, network.force_ipv4, ...) by reading the raw user YAML, so the env the whole process reads carried the USER's value even when an administrator pinned it — e.g. a managed timezone was overridden because gateway/run.py wrote the user's timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var first. Wired the shared apply_managed_overlay() into every config→env bridge: - gateway/run.py module-level startup bridge (timezone, redact_secrets, max_turns, terminal, display, gateway.strict, ...) - gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn re-bridge that keeps config authoritative over reloaded .env — must keep MANAGED authoritative on every turn, not just startup) - hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge (runs before load_config is usable, at import time) - hermes_cli/send_cmd.py top-level scalar config→env bridge Verified end-to-end against a writable managed dir (12/12 checks incl. timezone, logging, model, skin, gateway settings, write-guard) and in a clean process the gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an order-independent regression test for the bridge overlay.	2026-06-19 07:46:33 -07:00
Ben	b0e47a98f9	fix(managed-scope): honor managed scope in all standalone config loaders The skin bug was one instance of a class: several subsystems build their config dict directly from config.yaml instead of routing through hermes_cli.config.load_config (which carries the managed merge), so they silently ignored administrator-pinned values. Audited every config.yaml reader and fixed the behavioral-read bypasses: - gateway/config.py load_gateway_config (messaging gateway: session_reset, quick_commands, stt, model, ...) - gateway/run.py _load_gateway_config (its read_raw_config fast path also skipped the merge — read_raw_config returns raw user YAML) - tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin, reasoning_effort, service_tier, provider_routing) - cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing) - hermes_logging.py (logging.level/max_size_mb/backup_count) - hermes_time.py (timezone) - hermes_cli/doctor.py (memory-provider diagnostic reads effective config) All route through a new shared managed_scope.apply_managed_overlay() helper that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't shadow a managed literal, root-model-string normalization, leaf-merge) and is fail-open. cli.py's earlier inline fix is refactored onto the same helper. Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile distribution) are deliberately left reading raw user YAML — overlaying managed values there would persist them into the user file. The dashboard (web_server.py) already routes through load_config and needed no change. TUI loader caches the RAW config so _save_cfg never writes managed values to disk. Adds test_managed_scope_overlay.py (helper) and test_managed_scope_loaders.py (per-surface integration); mutation-checked.	2026-06-19 07:46:33 -07:00
teknium1	a58287afcb	Merge remote-tracking branch 'origin/main' into pr48275-rebase # Conflicts: # cron/scheduler.py	2026-06-19 07:40:29 -07:00
Ben Barclay	1e70df5fdd	feat(gateway): multiplex phase 4 — lifecycle guard + per-profile observability - _guard_named_profile_under_multiplexer: when the default gateway is running with gateway.multiplex_profiles=on, a named-profile 'hermes gateway run' hard -errors (pointing at the multiplexer) instead of double-binding that profile's platforms. Inert unless all hold: this invocation is a named profile, a default-profile gateway is alive, and its config has multiplexing on. --force overrides. Wired into run_gateway's guard chain. - write_runtime_status gains served_profiles: the secondary-adapter startup records [active] + multiplexed profiles into runtime_status.json so 'hermes status' can show per-profile coverage without a second probe. Absent for single-profile gateways. Tests: served_profiles round-trips and is absent by default; guard is inert for the default profile / under --force / when no default gateway is running.	2026-06-19 07:34:15 -07:00
Ben Barclay	d5d02eabb0	feat(gateway): multiplex phase 3 — secondary-profile adapter registry + conflict detection Bring up adapters for every profile the gateway serves, not just the active one. Keeps self.adapters as the default/active profile's map (the ~93 existing self.adapters[...] sites are untouched) and adds secondary profiles under self._profile_adapters[profile][platform]. - _start_secondary_profile_adapters loops profiles_to_serve(multiplex=True), skips the active profile (handled by the primary startup loop), and for each other profile loads its gateway config and creates+connects its enabled adapters under that profile's _profile_runtime_scope (home + secret scope). - Each secondary adapter gets _make_profile_message_handler(profile): stamps source.profile (when unset) before delegating to the shared _handle_message, so the agent turn and session key resolve to that profile. - Same-platform credential-conflict detection: _adapter_credential_fingerprint hashes the adapter's bot token (salted, truncated — never logs the token); two profiles claiming the same (platform, token) refuse the duplicate with a clear error naming both, since one token can't be polled twice. - Port-binding hard-error: a SECONDARY profile that enables a port-binding platform (webhook, api_server, msgraph_webhook, feishu, wecom_callback, bluebubbles, sms) is a config error and aborts startup via MultiplexConfigError — the default profile owns the single shared HTTP listener and serves every profile through the /p/<profile>/ prefix, so a second bind can only collide. Distinct from a transient connect failure (which logs + stays alive to retry): a config error writes gateway_state=startup_failed and exits cleanly with an actionable message (names the profile, the platform, and the fix). There is no valid reason to bind a second port once you've opted into a multiplexer. - Shutdown tears down secondary adapters alongside the primary ones. - Defensive getattr guards keep partial-construction unit tests (stop(), _run_agent on bare instances) working. No-op when multiplex_profiles is off (self._profile_adapters stays empty). Tests: fingerprint stability/log-safety/distinctness, profile message-handler stamping (and not overriding an already-stamped source), port-binding hard-error raises + names the profile/platform, non-binding platform is not rejected, and the guard set covers every TCP-binding adapter.	2026-06-19 07:34:15 -07:00
Ben Barclay	f35abb122a	feat(gateway): multiplex phase 1 — HTTP-inbound /p/<profile>/ routing (webhook) Serve webhook inbound for multiple profiles off the one shared listener via a URL prefix, with no second port bound. - SessionSource gains a 'profile' field (round-trips through to_dict/from_dict; omitted when unset so existing serialization is unchanged). It carries which profile an inbound message was routed to. - WebhookAdapter registers /p/{profile}/webhooks/{route_name} alongside the existing /webhooks/{route_name}. _resolve_request_profile validates the prefix against profiles_to_serve(): None when absent or multiplexing is off (ignored, handled as default — no spurious 404), the profile name when valid, _PROFILE_REJECTED (→ 404) when the profile isn't served. The resolved profile is stamped onto the SessionSource. - session-key namespacing and the per-turn home/credential scope now prefer source.profile: SessionStore._resolve_profile_for_key(source), _session_key_for_source fallback, and _resolve_profile_home_for_source all honor it (→ the agent turn resolves that profile's config/skills/credentials via the Phase 2 _profile_runtime_scope). Constraint: routing inbound needs no per-profile platform credential, but the agent still needs the routed profile's provider key — delivered by Phase 2's secret scope. api_server (OpenAI-compatible surface) profile routing is a focused follow-on; its source-construction path differs from webhook's. Tests: SessionSource.profile round-trip + namespace drive; _resolve_request_ profile accept/reject/ignore matrix.	2026-06-19 07:34:15 -07:00
Ben Barclay	f538470cf4	feat(gateway): multiplex phase 2 — fail-closed profile credential isolation (Workstream A) The credential gate. When multiplexing is active, a profile's secrets resolve from a context-local scope, never the process-global os.environ (which in a multiplexer may hold another profile's keys, and is inherited by every subprocess spawned with env=dict(os.environ)). - agent/secret_scope.py: get_secret() backed by a secret-scope contextvar. FAIL-CLOSED: when multiplex is active and no scope is installed, an unscoped read RAISES UnscopedSecretError instead of falling back to os.environ — a missed/new call site crashes loudly at that line rather than leaking a cross-profile value. Genuinely-global vars (HERMES_*, PATH, kanban paths, …) keep reading os.environ via an allowlist. load_env_file/build_profile_ secret_scope parse a profile .env into an isolated dict WITHOUT mutating os.environ. Off by default => transparent os.getenv behavior. - hermes_cli/runtime_provider.py: all credential/provider/base-url reads go through _getenv -> get_secret. - agent/credential_pool.py: env fallbacks route through get_secret (the ~/.hermes/.env-first preference is preserved and already profile-correct via the home override). - tools/mcp_tool.py: MCP config interpolation resolves through get_secret, so a server's picks up the routed profile's value. - gateway/run.py: set_multiplex_active() at GatewayRunner init; per-turn .env reload is a no-op for credentials in multiplex mode (secrets come from the scope, not global env); _profile_runtime_scope context manager combines the HERMES_HOME override + secret scope; _run_agent wraps _run_agent_inner in that scope (resolved via _resolve_profile_home_for_source) when multiplexing. Propagates into the agent worker thread for free via the existing copy_context() in _run_in_executor_with_context. Tests: 13 unit (fail-closed, scope isolation, global allowlist, .env parsing without environ mutation) + 7 E2E (runtime_provider + MCP interpolation prove two profiles isolated, unscoped read raises, globals still read environ).	2026-06-19 07:34:15 -07:00
Ben Barclay	d82f9fa7f7	feat(gateway): multiplex phase 0 — config flag, profile enumeration, profile-stamped session keys Foundations for serving multiple profiles from one gateway process, inert when off: - gateway.multiplex_profiles config flag (default false), round-trips through GatewayConfig and load_gateway_config (top-level + nested gateway.* form). - hermes_cli.profiles.profiles_to_serve(multiplex): the single chokepoint for which (profile, HERMES_HOME) pairs the gateway serves. Lightweight dir scan; active-profile-only when off, default + all named profiles when on. - build_session_key gains a profile= namespace slot. Default/None reuse the historical 'agent:main:...' literal BYTE-IDENTICALLY (no session migration, positional parsers unaffected); a named profile becomes 'agent:<profile>:...' so two profiles on the same platform/chat never collide. - SessionStore._resolve_profile_for_key + _session_key_for_source fallback resolve the namespace from the flag (legacy when off, active profile when on). Tests: byte-identical-when-off (parametrized), namespace isolation, positional layout preserved, config round-trip, profiles_to_serve enumeration.	2026-06-19 07:34:15 -07:00
teknium1	df2420f571	fix(gateway): keep non-Discord home-channel startup send byte-identical The salvaged non_conversational marking made the home-channel startup no-metadata branch always pass metadata= explicitly; for non-Discord platforms _non_conversational_metadata returns None, so Telegram/etc. went from adapter.send(chat_id, message) to adapter.send(..., metadata=None). Behaviorally identical but broke test_restart_notification's exact assert_called_once_with. Only attach metadata when the marker applies (Discord), restoring the original call shape elsewhere.	2026-06-19 07:29:27 -07:00
snav	caaa916289	fix(gateway): don't let delayed Discord status messages partition history backfill Discord channel-history backfill partitions on Hermes' last self-authored message. Asynchronous, non-conversational status sends (self-improvement review bubbles, heartbeats, background-process notifications, update status, gateway restart/online notices) land as ordinary bot messages, so a delayed status bump becomes the history boundary and swallows real messages that arrived after Hermes' actual reply. Mark these sends at the source via metadata["non_conversational"] (Discord only; other platforms' metadata is unchanged). The adapter no longer advances the history-boundary cache for marked sends and persists their IDs to a sidecar JSON so the cold-start scan can skip them by ID after a restart. A narrow regex recognizer remains only as an upgrade bridge for status bumps emitted by an older gateway that pre-dates the marking.	2026-06-19 07:29:27 -07:00
infinitycrew39	ca92e9a362	fix(gateway): refresh cached agent max_iterations from current config When a gateway agent is reused from cache, it retains the max_iterations from its initial creation. If config.yaml agent.max_turns or HERMES_MAX_ITERATIONS changed between turns, the cached agent's budget becomes stale. Before reusing a cached agent, refresh agent.max_iterations from the freshly-resolved value (read from env/config at line 14585). Fixes partial issue from PR #48127: handles fresh agent creation + cached agent reuse.	2026-06-19 06:31:13 -07:00
infinitycrew39	460b1e50e5	fix(gateway): refresh max_turns before resolving runtime budget	2026-06-19 06:31:13 -07:00
Ben	637aff46e7	Merge remote-tracking branch 'origin/main' into hermes/hermes-6fe26723	2026-06-19 15:17:13 +10:00
Ben Barclay	2c6e266e88	fix(relay): trigger self-provision on relay-config + NAS token, not is_managed() (#48724 ) self_provision_if_managed() gated on is_managed(), but is_managed() means "NixOS/package-manager-managed" (it keys on HERMES_MANAGED or a ~/.hermes/.managed marker) — NOT "NAS-hosted". A NAS-provisioned Fly agent sets NEITHER, so the gate was always False and relay self-provision SILENTLY no-oped on exactly the hosted agents it was built for. Caught live: a staging agent with GATEWAY_RELAY_URL correctly stamped logged "No messaging platforms enabled" and never dialed the connector; HERMES_MANAGED was unset on the machine. The unit tests had mocked is_managed()->True, so they passed while the real trigger never fired (mocked- trigger blind spot). Fix: drop the is_managed() gate and rename self_provision_if_managed -> self_provision_relay. The real trigger is now "relay_url() set + no pinned secret + a resolvable NAS token", which is both NAS-independent and self-guarding: - NAS-hosted agent: GATEWAY_RELAY_URL + no pinned secret + bootstrapped NAS token -> self-provisions. - Self-hosted + `hermes gateway enroll`: pinned GATEWAY_RELAY_SECRET -> skipped (existing secret-present guard). - Self-hosted, unenrolled, no NAS identity: resolve_nous_access_token() fails -> graceful no-op (existing fail-soft path). Security: unchanged trust model. The connector still derives tenant from the validated NAS token; this only broadens WHEN the provision attempt fires, and every broadened case is still guarded by token-resolution + pinned-secret-skip. Tests: replaced the (wrong) "skips when not managed" test with a regression test proving a NAS host where is_managed()==False STILL provisions; renamed all call sites; added a "no NAS token -> non-fatal skip" test for the self-hosted branch. 88 relay tests pass. Relay-adapter lane. EXPERIMENTAL.	2026-06-19 01:01:24 +00:00
Ben	28531b6186	Merge remote-tracking branch 'origin/main' into hermes/hermes-6fe26723	2026-06-18 15:31:29 +10:00
Ben Barclay	0ddd21c74e	feat(relay): managed-boot self-provision client (Phase 3, gateway side) (#48242 ) The gateway half of relay Phase 3. On a MANAGED boot with relay configured and no secret pinned, the runtime self-provisions its relay credentials IN-PROCESS: resolve the agent's own Nous access token (resolve_nous_access_token) -> POST the connector's /relay/provision asserting its own endpoint + route keys -> set GATEWAY_RELAY_ID/SECRET/DELIVERY_KEY into os.environ so the immediately- following register_relay_adapter() reads them and dials out authenticated. No human, no enrollment token, no disk write — the creds live only in process memory (save_env_value refuses under managed anyway, and keeping the secret off any volume is the stronger posture). Stateless: process-env creds don't survive a restart, so a managed container re-provisions every boot; the connector's rotation window covers a still-connected prior instance. An explicitly-pinned GATEWAY_RELAY_SECRET is respected (skip). Self-hosted is unchanged: humans keep using `hermes gateway enroll`. Endpoint provenance is gateway-asserted (GATEWAY_RELAY_ENDPOINT + GATEWAY_RELAY_ROUTE_KEYS, env or gateway.relay_* config) — uniform code path whether the operator sets it (self-hosted) or NAS stamps it (hosted, the only case NAS knows the public URL). Both absent -> outbound-only provisioning (credentials, no inbound routes). The connector scopes the asserted endpoint to the verified tenant, so it stays within the security model. - gateway/relay/__init__.py: relay_endpoint(), relay_route_keys(), _provision_url(), _post_provision(), self_provision_if_managed() (never raises — a provision failure logs and boots without relay auth). - gateway/run.py: call self_provision_if_managed() immediately before register_relay_adapter() in the startup path. Tests: 12 unit (trigger logic, respect-pinned-secret, in-process env wiring, endpoint+routes vs outbound-only, fail-soft on token/connector failure); mutation-checked (drop is_managed guard / pinned-secret guard -> tests fail). Cross-repo live E2E driver lands on the connector side (depends on this). EXPERIMENTAL: relay auth scheme may change until >=2 Class-1 platforms validate.	2026-06-18 15:25:29 +10:00
Ben	abbd8646eb	feat(gateway,desktop): start cron via resolved CronScheduler provider Phase 3 — rebind both ticker call sites to resolve_cron_scheduler(). Default (built-in) path is byte-identical; Phase 0 characterization tests + the full gateway suite (6919) stay green. Task 3.1: split gateway/run.py _start_cron_ticker into: - _start_gateway_housekeeping() — the gateway-only chores (channel-dir refresh, image/doc cache cleanup, paste sweep, curator poll), now on their own loop/thread, independent of which cron provider is active. - _start_cron_ticker() — kept as a DEPRECATED shim that runs only the built-in InProcessCronScheduler().start(), preserving the symbol for hermes_cli/debug.py and the Phase 0 characterization test. Task 3.2: start_gateway() resolves the provider and runs provider.start() in the 'cron-scheduler' thread, plus a second 'gateway-housekeeping' thread; teardown sets the shared cron_stop, calls provider.stop(), joins both. Task 3.3: desktop _start_desktop_cron_ticker() swapped its inline tick loop for resolve_cron_scheduler().start() (no adapters/loop — desktop has none). The provider owns ONLY the cron tick (so an external scale-to-zero provider with no 60s loop fits); gateway housekeeping is decoupled from the cron trigger. Both threads share cron_stop. Verified: full tests/cron/ (453) + full tests/gateway/ (6919) green. Manual gateway smoke (Task 3.4) is operator-run, pending.	2026-06-18 14:14:53 +10:00
Ben	237fa7d29c	feat(gateway): register relay adapter from config; drop HERMES_GATEWAY_RELAY gate Wire the relay adapter into gateway startup and make activation config-driven instead of a dark-launch flag. - gateway/relay/__init__.py: replace relay_enabled()/HERMES_GATEWAY_RELAY with relay_url() (GATEWAY_RELAY_URL env or gateway.relay_url in config.yaml) — the same shape as gateway.proxy_url. register_relay_adapter() registers when a URL is configured and builds a live WebSocketRelayTransport; with no URL it's a no-op (direct/single-tenant deployments unaffected). force=True keeps the transport-less adapter for unit tests. relay_platform_identity() reads the hello platform/botId from GATEWAY_RELAY_PLATFORM/GATEWAY_RELAY_BOT_ID. - gateway/run.py: call register_relay_adapter() during GatewayRunner.start(), right after plugin discovery, so a configured connector relay is registered on every boot. Failures are logged, never block startup. This removes the dark-launch posture: the relay is on whenever it's configured, shipping the production end state rather than hiding it behind a flag.	2026-06-17 16:37:45 -07:00
teknium	36ae958473	feat(gateway): gate message timestamps behind opt-in (default off) Follow-up to salvaged PR #41633: the timestamp prefix injection was unconditional. Gate the in-context render behind gateway.message_timestamps.enabled (default false) at both the live-message and history-replay sites; timestamp metadata is still captured + persisted regardless so the toggle can be flipped on later. Add DEFAULT_CONFIG entry, docs, and gate tests.	2026-06-16 15:49:59 -07:00
Wolfram Ravenwolf	bd7fc8fdcd	feat(gateway): inject stable human-readable message timestamps Consolidates these related Amy fork patches: - 429830f39 feat(gateway): inject message timestamps into user messages for LLM context - 3c3d6fac0 fix: handle both ISO string and epoch float timestamps in history replay - 2874f7725 feat: human-friendly timestamp format with weekday and timezone name - 3735f4c8b fix: render gateway message timestamps once	2026-06-16 15:49:59 -07:00
Sierra (Hermes Agent)	01ae9b853e	fix(telegram): resolve replies to rich (sendRichMessage) messages Telegram does not echo a sendRichMessage's content back in reply_to_message (.text/.caption empty, .api_kwargs None), so replies to rich sends (briefings, the gateway's own rich finals) arrived with no quotable text and the [Replying to: ...] injection was skipped. Remember message_id -> text at send time in a best-effort JSON index (gateway/rich_sent_store.py), and recover it on inbound when text and caption are both empty. Best-effort and no-throw throughout: any failure degrades to prior behavior and never breaks a send or message. Salvaged from #47375 by @x1erra. Dropped the cross-platform run.py reply-prefix rewrite (out of scope; bloated every reply on every platform) and scrubbed a docstring reference to an out-of-repo script. Kept the inbound reply_to logging enrichment used to verify the fix.	2026-06-16 13:04:20 -07:00
Wolfram Ravenwolf	e76e7b5073	feat(hooks): session:compress event_callback for MemPalace sync	2026-06-16 11:45:36 -07:00
Wolfram Ravenwolf	16fc717091	fix(mattermost): harden delivery hygiene PROBLEM: Mattermost threads can become invalid or enormous, exposing two failure modes: internal scratch/reasoning/commentary displays could leak into persistent Mattermost threads via global display toggles, while rejected threaded user-visible replies could disappear unless every failed send fell back flat. A broad flat fallback would pollute channels with tool/status/progress noise. SOLUTION: Require explicit Mattermost platform opt-in for scratch displays, keep using the existing notify=True metadata marker for user-visible final text/media/file replies, and allow the Mattermost plugin adapter to flat-fallback only notify-worthy sends whose threaded POST failure looks like a broken root/thread. Keep tool/status/progress and other non-notify sends thread-strict. Add regression tests for display opt-in, notify-only broken-thread fallback, generic API failure suppression, and stream notify metadata. Verification: tests/gateway/test_mattermost.py tests/gateway/test_stream_consumer.py tests/gateway/test_stream_consumer_thread_routing.py tests/gateway/test_stream_consumer_fresh_final.py tests/gateway/test_stream_consumer_draft.py; tests/gateway/test_session_api.py tests/gateway/test_status_command.py tests/gateway/test_resume_command.py tests/hermes_cli/test_commands.py; py_compile touched gateway files; git diff --check. Session: Mattermost thread 6qg8e9dd1pd9pkhi74xyaa1mry, 2026-06-01.	2026-06-16 06:34:54 -07:00
teknium	6373aba80f	feat(gateway): rename to tool_progress_grouping, add config/docs/tests Follow-up to salvaged PR #41620: - Rename tool_progress_style -> tool_progress_grouping (clearer intent) - Add display.tool_progress_grouping to DEFAULT_CONFIG (accumulate default) - Document in messaging docs incl. 'separate is noisier, only where progress enabled' - Add resolver tests (default/global/override/invalid/case)	2026-06-16 05:49:24 -07:00
Wolfram Ravenwolf	fc956b9db6	feat: add tool_progress_style config (accumulate vs separate) Add display.tool_progress_style setting to control how tool progress messages are displayed in chat platforms: - 'accumulate' (default): Edit a single message with all tool calls (new v0.9.0 behavior) - 'separate': Send each tool call as its own message, interleaved with thinking messages (pre-v0.9 behavior, better readability) The setting participates in the per-platform display override system and can be set globally or per-platform. Files: gateway/display_config.py, gateway/run.py	2026-06-16 05:49:24 -07:00
Wolfram Ravenwolf	20b1f4f3fb	feat(memory): configurable background memory update notifications Background memory reviews now support three notification modes, configured via display.memory_notifications in config.yaml: off — no chat notification (still logged to stdout/HA log) on — generic '💾 Memory updated' (default, unchanged behavior) verbose — content preview with action indicators: 💾 Memory ➕ Hermes Repo liegt unter /config/amy/hermes-agent/... 💾 Memory ✏️ Updated repo path from claude-code to hermes-agent... 💾 Memory ➖ old entry about claude-code path... Previews are truncated to 120 chars for adds/replaces, 60 for removes. Each action gets its own line in verbose mode for readability. Files: run_agent.py, gateway/run.py	2026-06-16 05:45:40 -07:00
Teknium	5a0e0d35b9	fix(mattermost): preserve thread-local delivery hygiene Salvage the valid thread-routing pieces from #41640: - route Mattermost progress/status sends through metadata thread IDs - treat top-level Mattermost channel posts as thread roots for progress - preserve thread metadata through media/file sends - allow flat fallback only for final notify-worthy replies on confirmed broken roots Co-authored-by: Wolfram Ravenwolf <github.com@wolfram.ravenwolf.de>	2026-06-15 15:06:23 -07:00
Teknium	c66ecf0bc3	feat(delegation): async background subagents via delegate_task(background=true) (#40946 ) * feat(delegation): async background subagents via delegate_task(background=true) delegate_task(background=true) dispatches a subagent that runs in the background and returns a handle immediately, so the user and model keep working while it runs. The full result — plus the original task source — re-enters the conversation as a new turn when the subagent finishes, riding the same completion-queue rail as terminal background processes. - tools/async_delegation.py: daemon-executor registry, capacity cap, rich self-contained completion event pushed onto the shared process_registry.completion_queue (type='async_delegation'). - delegate_tool.py: background param + single-task dispatch branch; batch async rejected (v1). - process_registry.py: format_process_notification renders the rich task-source block (goal/context/toolsets/model/status/result). - gateway/run.py: dedicated _async_delegation_watcher drains + injects results into the originating session (idle + post-turn), session_key routing enrichment, shutdown interrupt of dangling delegations. - config: delegation.max_async_children (default 3). Reuses the existing idle-drain wiring rather than mutating a running agent loop, preserving message-role alternation and prompt-cache invariants. 13 targeted tests; CLI + gateway paths E2E-verified. * test(delegation): make async non-blocking tests environment-independent CI 'test (5)' flaked on a cold, 8-worker runner: the first delegate_task(background=true) call measured 2.27s of one-time setup (config load + child-agent construction + imports), tripping the elapsed < 1.0 wall-clock assertion. That assertion was testing setup overhead, not blocking. Replace the wall-clock thresholds with the real invariant: dispatch returns while the child is still gated (active_count == 1, completion queue empty), which a synchronous impl could not do. Keep only a loose 4s sanity backstop well under the runner's 5s gate. * fix(delegation): harden async background delegation Follow-up review fixes: - Detach background child from parent._active_children at dispatch — otherwise parent-turn interrupts (Ctrl+C, mid-turn steering), cache evicts (release_clients), and session close (/new) kill/close the detached subagent mid-run, defeating the point of background mode. Lifecycle is owned by the async registry's interrupt_fn. - Make the capacity check atomic with the record insert (TOCTOU: two concurrent dispatches could both pass active_count() and exceed the cap). - TUI dedup: key async_delegation events by delegation_id — the fallthrough keyed them all as ("", type), suppressing every completion after the first in the desktop/TUI status feed. - CLI /stop now interrupts running background delegations and /agents lists them (they live outside the process registry and were invisible). - Drop stray unbalanced ']' line from the re-injection block and the unused _ASYNC_DEFAULT import. Tests: detach-at-dispatch + concurrent-capacity race added (15 total in test_async_delegation.py); 137 delegate + 140 process-registry/notify/watch + 7 TUI dedup tests pass. * fix(delegation): harden async background completion drains	2026-06-15 13:33:12 -07:00
Teknium	3e7e9b24d4	fix: harden salvaged session and browser improvements Polish salvaged contributor work before PR review: - read browser inactivity timeout from config with documented fallback - skip redundant v10 trigram backfill before v11 FTS rebuild - show delegate_task goals safely in progress previews - show gateway status model/context without redundant token wording - wire gateway /sessions to shared session-listing helpers - map Ravenwolf author emails for release attribution Co-authored-by: Wolfram Ravenwolf <github.com@wolfram.ravenwolf.de> Co-authored-by: Amy Ravenwolf <amy@ravenwolf.de>	2026-06-15 07:46:34 -07:00
Teknium	be7c919bf9	fix(process): label background completion causes (#46659 ) Track why a background process finished and include that source in notify-on-complete messages so SIGTERM from process.kill, kill_all, backend loss, and ordinary exits are distinguishable.	2026-06-15 07:08:24 -07:00
kshitijk4poor	3bc4a2ff78	fix(gateway): re-baseline agent-cache message_count after each turn The #45966 cross-process coherence guard snapshots a session's on-disk message_count next to the cached agent and rebuilds the agent when the count changes. But the snapshot is taken at agent-BUILD time — before the turn writes its own user + assistant (+ tool) rows — and the cache entry is never rewritten on a reuse. So this process's OWN turn grows message_count, and the very next turn sees a mismatch and rebuilds the agent. That happens every turn, for every conversation, silently destroying the per-conversation prompt caching the cache exists to protect (AGENTS.md: prompt caching is sacred). Add _refresh_agent_cache_message_count(): after a turn completes and the agent has flushed its rows to the SessionDB, re-baseline the stored count to the now-current value. The guard then fires ONLY when a DIFFERENT process changes the transcript — preserving the #45966 fix while keeping the cache warm for normal single-process operation. Tests drive the real SessionDB + the real guard condition: 5 consecutive same-process turns now all REUSE the cached agent (0 before the fix); a cross-process append still invalidates; and the re-baseline is fail-safe (no DB, falsy session_id, raising probe, legacy 2-tuple, pending sentinel all no-op).	2026-06-14 22:58:55 +05:30
kyssta-exe	7f245b0035	fix(gateway): invalidate agent cache on cross-process session writes (#45966 ) (cherry picked from commit `6d0f79defe`)	2026-06-14 22:54:39 +05:30
Teknium	2c174bce24	fix(gateway): preserve new input on interrupted replay cleanup	2026-06-14 05:10:39 -07:00
Arnaud L	5191c1c2ce	fix(gateway): stop replaying interrupted tool-call tails and auto-continue notes Three changes to prevent infinite re-execution loops when a user sends a new message while long-running tools are executing: 1. Filter interrupted tool results in _build_gateway_agent_history: skip tool messages whose content contains [Command interrupted] or exit_code 130 — they represent partial execution, not valid results. 2. Don't replay auto-continue notes as user messages: detect gateway-injected [System note: ...] / [IMPORTANT: ...] prefixes and skip them in _build_gateway_agent_history so the LLM doesn't see 4+ messages from 'the user' telling it to finish old work. 3. Fix the wording: the system note now instructs the model to address the user's NEW message FIRST, IGNORE pending results, and NOT re-execute old tool calls. Closes #45230	2026-06-14 05:10:39 -07:00
Aldo	293c04fef6	fix(gateway): suppress exact silence tokens without mutating history	2026-06-14 03:25:08 -07:00
Teknium	10bad2faf1	fix(gateway): serialize startup auto-resume before inbound (#46074 ) Gateway startup now queues real inbound messages until restart-interrupted auto-resume turns have completed, preventing duplicate agents for the same session after a restart.	2026-06-14 03:21:06 -07:00
Teknium	723c2331bd	fix: make profile subprocess HOME policy explicit	2026-06-14 03:20:21 -07:00
Teknium	dc90ca4e17	fix(ssl): run CA guard during agent initialization	2026-06-13 21:14:32 -07:00
Teknium	af5b526472	fix(ssl): validate CA bundle paths before provider calls	2026-06-13 21:14:32 -07:00
chromalinx	a218a0f156	fix(agent,gateway,doctor): add SSL CA cert bundle fail-fast guard A stale certifi CA bundle after a partial `hermes update` used to crash the agent on the first outbound HTTPS call with a raw traceback and trap the gateway in a retry loop. This patch: * Adds `agent/errors.py` with a typed `SSLConfigurationError` * Adds `agent/ssl_guard.py` with a `verify_ca_bundle()` pre-flight that asserts the bundle exists, is non-trivial in size, and can build a working SSLContext. On macOS, it falls back to the system trust store when the bundle is empty but the system store is healthy (covers corporate proxies / MDM setups). * Wires the guard into `run_agent.py` and `gateway/run.py` right after the `hermes_bootstrap` import, inside a try/except so a bug in the guard itself can never prevent startup. * Adds a `SSL / CA Certificates` section to `hermes_cli doctor` so users can detect the failure with one command. * Adds unit tests covering the healthy, missing, empty, skip-env, and macOS-fallback paths. * Adds an RCA document describing the failure mode and the recovery path (`pip install -e .`). When the bundle is broken the user sees: \u26a0\ufe0f SSL certificate bundle issue detected. Run: pip install -e . `HERMES_SKIP_SSL_GUARD=1` disables the check for sandboxed environments that ship their own trust store.	2026-06-13 21:14:32 -07:00
kshitijk4poor	63097ee0d7	test(gateway): cover auto-resume full-path no-regression; clarify guard docstring The salvaged fix's two regression tests mock adapter.handle_message, so they only assert the pre-claimed sentinel is set/cleaned around a stub — they never drive the real dispatch chain. Add a full-path test that exercises _schedule_resume_pending_sessions -> _guarded_handle_message -> adapter.handle_message -> _process_message_background -> _handle_message and asserts the resumed session's agent runs EXACTLY ONCE: not zero (the pre-claim must not self-bounce the resume into a queued no-op) and not twice (the duplicate-agent bug #45456 the fix targets). Also assert no leaked sentinel and no orphaned pending event after the drain settles. Tighten the _guarded_handle_message docstring: on current main the real sentinel is taken over inside _handle_message (not _process_message_background), and note the `is _AGENT_PENDING_SENTINEL` guard only releases the slot we ourselves placed, never one a live run owns.	2026-06-13 23:39:35 +05:30
liuhao1024	6e2fd955ca	fix(gateway): claim session slot before auto-resume task to prevent duplicate agents When the gateway restarts and auto-resumes an interrupted session, an inbound message arriving in the window between `asyncio.create_task()` and the task's first await could spin up a second AIAgent for the same session. Both agents would then process messages concurrently, producing interleaved duplicate responses (#45456). Fix: set `_AGENT_PENDING_SENTINEL` in `_running_agents` immediately after the "already running" check, before creating the task. This closes the race window — any inbound message sees the slot as occupied and queues behind the auto-resume. A `_guarded_handle_message` wrapper ensures the pre-claimed sentinel is always released, even if `handle_message` raises before reaching `_process_message_background` (whose `finally` block handles normal cleanup). (cherry picked from commit `85150c976b`)	2026-06-13 23:36:51 +05:30

1 2 3 4 5 ...

1077 commits