hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-22 10:32:00 +00:00

Author	SHA1	Message	Date
Teknium	680732c104	fix(gateway): never interrupt a busy session with an internal completion event (#49738 ) Async-delegation completions (delegate_task(background=true)) and background-process completions (terminal notify_on_complete) re-enter the originating session as internal MessageEvents. When the session was busy, _handle_active_session_busy_message treated them like a user TEXT message and the default busy_input_mode='interrupt' aborted the active turn (and sent a 'Interrupting current task' ack) — the opposite of the design invariant that a completion surfaces as a new turn only when idle. Short-circuit internal events to return False so the base adapter queues them silently (it already excludes internal events from debounce), cascading them as the next turn after the current one finishes.	2026-06-20 10:57:41 -07:00
kshitijk4poor	1fbf48d4ad	fix(compression): make in-place compaction durable + rotation-independent end-to-end Review (Codex + 3-agent parallel) found the first cut of in-place mode was incomplete: it only updated the system prompt, so the persisted transcript stayed 'full history + summary' and the next turn/resume reloaded the full history and immediately re-compacted (a loop), and every downstream layer that keyed off session-id rotation silently no-op'd. The session_id was doing double duty as the 'compaction happened' signal. This wires the whole path so removing rotation is actually complete: Agent (agent/conversation_compression.py): - In-place now DURABLY replaces the transcript: replace_messages(session_id, compressed) on the same row (the canonical store the gateway reloads from), not just update_system_prompt. Resume reloads the compacted set; no loop. - Reset flush identity/cursor (_last_flushed_db_idx=0, _flushed_db_message_ids cleared) so next-turn appends diff against the compacted transcript. - Expose a rotation-independent signal: agent._last_compaction_in_place, and in_place=True on the session:compress event. - Fire the compaction-boundary hooks (context-engine on_session_start, memory manager on_session_switch, reason='compression') in BOTH modes — in-place passes the same id as parent so DAG/buffer state still checkpoints. Without this, memory/context plugins miss every in-place compaction. Gateway auto-compress (gateway/run.py): - Read agent._last_compaction_in_place; set history_offset=0 on rotation OR in-place (both return the compacted set, so slicing past the pre-compaction length would drop everything). Carry compacted_in_place in the result dict. - No extra rewrite needed: the agent shares the gateway's SessionDB, so its replace_messages already updated the canonical store load_transcript reads. Manual /compress (gateway/slash_commands.py): - The throwaway /compress agent has no _session_db, so rewrite_transcript is the durable write. Previously gated behind 'if rotated:' which treated 'id unchanged' as the #44794 data-loss failure case and SKIPPED the rewrite — making /compress a silent no-op in in-place mode. Now rewrites on rotated OR in_place; the data-loss guard still fires only for the genuine no-rotation-AND-not-in-place failure. Hygiene auto-compress already writes _compressed to the same id unconditionally (its agent has no _session_db, can't rotate) — correct for in-place, no change. Tests (tests/run_agent/test_in_place_compaction.py): - Assert the DURABLE transcript IS the compacted set after reload (get_messages_as_conversation == compacted), message_count==2, flush identity reset, and the rotation-independent signal set on in-place / unset on rotation. Rotation regression guard unchanged. Verified: 64 tests green across in-place + rotation/persistence/boundary/ concurrent/failure-sync/command/cli suites; E2E both modes (durable replace, gateway offset=0, rotation preserves old transcript); ruff clean. Still default-off.	2026-06-20 10:57:07 -07:00
Teknium	5600105478	refactor(gateway): migrate slack/dingtalk/whatsapp/matrix/feishu/telegram/wecom/email/sms adapters to bundled plugins Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite, wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained bundled plugins under plugins/platforms/<x>/, discovered via the platform registry. Strips the per-platform core touchpoints from gateway/run.py, gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and tools/send_message_tool.py. Carries forward the migration fixes (explicit enabled:false honored, get_connected_platforms forces discovery, plugin is_connected via gateway.get_env_value, logs --component gateway matches plugins.platforms.*, matrix hidden on Windows). Additionally ports config keys main added since the PR base: the matrix plugin's _apply_yaml_config now also covers allowed_users, ignore_user_patterns, process_notices, and session_scope (the inline gateway/config.py matrix block gained these in the 1340 commits the PR sat open; they would otherwise have been silently dropped on deletion).	2026-06-20 10:26:45 -07:00
kshitijk4poor	26d9a3c710	fix(signal): FIFO-evict the quote-detection timestamp cache `_sent_message_timestamps` (the reply-to-own-message quote cache) used a `set` evicted with `set.pop()`, which removes an ARBITRARY element — so once more than the cap (500) outbound timestamps are tracked, a still-recent timestamp could be dropped while older ones survive, missing a genuine reply-to-own-message. Convert it to an OrderedDict with FIFO (oldest-first) eviction, mirroring the recently-hardened echo ring (#31250). This closes the same bug class on the sibling cache. Adds a regression test asserting oldest-first eviction + MRU promotion.	2026-06-20 21:00:46 +05:30
w31rdm4ch1nZ	332f88f6a6	fix(signal): harden recently-sent echo ring with LRU + TTL	2026-06-20 20:50:52 +05:30
kshitijk4poor	32a97a20af	fix(signal): strip self-mention in all groups, not just require_mention Review follow-up on the salvaged self-mention strip (#31217): the original only stripped the bot's rendered @<number>/@<uuid> self-mention inside the `require_mention=true` branch, so groups with require_mention=false still leaked it into the agent text. Hoist the strip to run for every group message (fixing the whole bug class), and collapse the doubled space a mid-sentence removal leaves while preserving intentional newlines.	2026-06-20 16:27:28 +05:30
Kailigithub	40b6ac9ac7	fix(signal): send explicit stop-typing RPC when cancelling indicator	2026-06-20 16:23:41 +05:30
Rick Ratmansky	96b10327b6	fix(signal): strip bot self-mention from group messages before agent dispatch	2026-06-20 16:23:41 +05:30
lkz-de	96db7c6883	fix(signal): preserve quoted reply context Carry Signal quote metadata through gateway events so replies to assistant messages include the quoted context without personalizing comments.	2026-06-20 15:16:53 +05:30
kshitij	ff50a88617	Merge pull request #49558 from NousResearch/salvage/env-var-guards-48735	2026-06-20 15:11:54 +05:30
kshitijk4poor	a7dd98c860	fix(env): guard remaining malformed int/float env var casts with utils helpers Widen the env_float() guard from #48735 across the whole bug class: a non-numeric value (e.g. a stale .env "HERMES_API_TIMEOUT=abc" or a typo'd port) raised an unhandled ValueError and crashed adapter/agent init. Converts 22 genuinely-unguarded first-party int/float(os.getenv()) sites to the canonical utils.env_int / utils.env_float helpers (the established house pattern), instead of duplicating per-module helpers or inline try/except: - gateway/config.py: WECOM_CALLBACK_PORT, BLUEBUBBLES_WEBHOOK_PORT - gateway/platforms/email.py: EMAIL_IMAP/SMTP_PORT, EMAIL_POLL_INTERVAL - gateway/platforms/feishu.py: dedup cache + text/media batch settings - gateway/platforms/wecom.py, discord/adapter.py: text batch delays - gateway/platforms/telegram.py: media batch delay, TELEGRAM_WEBHOOK_PORT - gateway/platforms/whatsapp.py: WHATSAPP_NPM_INSTALL_TIMEOUT - hermes_cli/auth.py: CODEX/XAI refresh timeouts - agent/chat_completion_helpers.py: API/stream read/stale timeouts - run_agent.py, agent/auxiliary_client.py: API + nous timeouts Sites already guarded by try/except or local helpers are left untouched. The HERMES_MAX_ITERATIONS sites are already guarded on main via _current_max_iterations(), so they are not included.	2026-06-20 14:54:36 +05:30
kshitijk4poor	abafba0762	refactor(signal): correct STT-fallback comment, type the markdown wrapper, make AAC test portable Review follow-up on the salvaged AAC + markdown changes: - Fix an inaccurate comment claiming the STT layer has a sniff-and-remux fallback (verified: no such fallback exists; the ffmpeg-absent path caches raw ADTS and STT may reject it). - Type the _markdown_to_signal wrapper as tuple[str, list[str]] to match the shared helper instead of a bare tuple. - Replace the hardcoded /home/pi/... test fixture with a runtime-generated ADTS AAC sample so the remux round-trip actually runs in CI (skips only when ffmpeg is absent) instead of always-skipping.	2026-06-20 14:24:29 +05:30
jasnoorgill	da34fca2bb	fix(signal): detect ADTS AAC voice notes and remux to MP4 Android Signal delivers voice notes as raw ADTS AAC frames, which share the `0xFF 0xFx` sync word with MPEG-1/2 Layer 3 (MP3). The `_guess_extension` byte-signature test in gateway/platforms/signal.py was matching both, so ADTS AAC was being misclassified as MP3 — saved to disk with the wrong extension and rejected by every major STT API (Groq, OpenAI) because their server-side format sniffers inspect the actual codec, not the file extension. Two changes: 1. Tighten the MP3 vs ADTS disambiguator. ADTS packs `ID`, `layer`, and `protection_absent` into bits 3-0 of byte 1, where `ID=0` and `layer=00` for AAC. Real MP3 has `ID=1` and `layer` in {01, 10, 11}. The mask `0xF6` against target `0xF0` cleanly separates them. 2. Remux raw ADTS AAC to MP4 container at the cache step via `ffmpeg -c:a copy`. Single demux/remux, no re-encode, no quality loss, sub-100ms on a Pi 5. The cached file is a normal `.m4a` that all major STT providers accept. ffmpeg is a transitive dependency of many other Hermes features (TTS, video skills) so this isn't a new install requirement; the remux degrades gracefully to a no-op if ffmpeg is missing. The new helper `_remux_aac_to_m4a` is unit-tested with a real Android voice note from the audio cache that originally triggered the bug, plus synthetic ADTS frames for the byte-level disambiguator and garbage-input graceful failure. Closes the gap that broke transcription for any Android Signal user sending voice messages to Hermes.	2026-06-20 13:48:05 +05:30
lkz-de	905820b59f	fix(signal): share markdown formatting across send paths Route Signal send paths through shared markdown formatting helpers and render markdown bullets consistently as Unicode bullets. Add coverage for Signal formatting and send_message integration.	2026-06-20 13:47:14 +05:30
helix4u	c253b07380	fix(model): clear stale endpoint credentials across switches	2026-06-19 19:58:26 -07:00
joaomarcos	75ed07ace8	fix(gateway): break the restart loop at the source on session resume When a tool call itself restarts the gateway (docker restart, systemctl restart, and similar), the process is terminated mid-call — before the tool result is persisted and before the orderly drain rewind can run. The transcript tail is left as an assistant(tool_calls) with no matching tool answer. On resume the model re-issues the unanswered call, taking the gateway down again — an infinite loop (#49201). Source fix: _build_gateway_agent_history now strips a trailing assistant(tool_calls) block that has no tool answers (_strip_dangling_tool_call_tail), so there is nothing for the model to re-execute. This complements _strip_interrupted_tool_tails, which only handles the case where a tool result row exists with an interrupt marker. Cognitive backstop: the resume-pending system note now states that any restart command in the history already ran and must not be re-executed or verified, and the empty-message auto-resume startup turn reports recovery and asks for instructions instead of the nonsensical "address the user's NEW message" (there is no new message on that turn). Reimplements the intent of #49243 by @JoaoMarcos44 at the replay layer. Fixes #49201	2026-06-19 16:59:58 -07:00
joaomarcos	3a6c171e9e	fix(gateway): log signal transport response and bubble cron live adapter errors	2026-06-19 16:59:38 -07:00
joaomarcos	5649b8649a	Fix silent delivery failures in Signal live adapter (#49260 )	2026-06-19 16:59:38 -07:00
Gille	a7983d5ad7	fix(dashboard): hide sidecar sessions from history (#49269 ) * fix(dashboard): hide sidecar sessions from history * test(dashboard): allow sidecar source in session payload	2026-06-19 18:06:38 -04:00
Evo	2fe78d1ae3	fix(gateway): persist inline-keyboard model-picker selections by default #49066 made /model text and the CLI picker persist to config.yaml by default, but the gateway (Telegram/Discord/Matrix) inline-keyboard picker callback stayed session-only. Mirror the text path's persist block so a tapped model survives across launches like a typed one.	2026-06-20 02:32:44 +05:30
kshitijk4poor	d4e7dd609d	refactor(windows): tidy managed-node resolver helpers Behavior-preserving cleanups on the managed-node resolver: - Hoist _candidate_node_command_names() out of the inner dir loop in find_hermes_node_executable (computed once, not per directory). - Drop redundant os.environ.copy() at the two with_hermes_node_path( os.environ.copy()) sites \u2014 the helper already copies os.environ when called with no argument (verified env-equivalent). - Add reciprocal keep-in-sync comments between iter_hermes_node_dirs() (hermes_constants.py) and hermesManagedNodePathEntries() (electron main.cjs), which mirror the same platform-ordering rule across the Python/Node boundary.	2026-06-20 02:12:16 +05:30
helix4u	7a7b56d498	fix(windows): prefer managed node for whatsapp and desktop	2026-06-20 02:00:37 +05:30
hakanpak	38f1a923af	fix(gateway): rename the Telegram topic from /title, not only auto-titles Auto-generated session titles already rename the Telegram forum topic via the title_callback path, but the /title command only wrote the session title to the database. On a Telegram topic lane the visible topic kept its auto-assigned name, so a user who ran /title to override it saw no change. Propagate the user-chosen title to the topic by calling the existing _schedule_telegram_topic_title_rename helper on a successful /title set. It already no-ops off Telegram topic lanes and when auto-rename is disabled.	2026-06-20 01:54:16 +05:30
alt-glitch	f3e967aae5	fix(mcp): round-3 polish — generation capture adjacency + gateway contract note Third review pass (Hermes subagent) declared convergence: no BLOCKING, the round-2 generation-aware publish / context-engine staging / CLI reload / ACP routing all verified correct by hand and by test. - agent_init: capture _tool_snapshot_generation immediately before the tool snapshot (was ~425 lines earlier); removes a harmless skew window so the recorded generation always matches the snapshot it describes. - gateway/run.py _execute_mcp_reload: keep preserving each cached agent's build-time enabled_toolsets EXACTLY (do NOT merge newly-connected servers like CLI/TUI do) and document WHY — gateway sessions can be deliberately locked down, and test_reload_mcp_preserves_per_agent_toolset_overrides asserts this. A reviewer suggested "parity" here; it would have violated that contract.	2026-06-19 11:57:43 -07:00
alt-glitch	93d6e73028	fix(mcp): expose late-connecting MCP tools to the agent (TUI/CLI/gateway) MCP servers that connect after the agent's one-time tool snapshot were invisible for the whole session. Two root causes, fixed together: 1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers commonly take 2-6s on a cold connect, so they missed the window and their tools never entered the agent's snapshot. `thread.join(timeout)` already returns the instant discovery completes, so raising the bound costs ~0s for the common case (no MCP / fast servers) and only ever blocks for a genuinely-pending server, capped so a dead server can't freeze startup. The bound is now configurable via `mcp_discovery_timeout` (config.yaml, default 5.0s). 2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI `reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh thread), and the late-refresh detected changes by tool COUNT — missing an equal-size add/remove swap. Consolidated into one shared `tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by tool NAME, mutates the agent under a lock (thread-safe), and respects the agent's own enabled/disabled toolsets. The late-binding refresh keeps its pre-first-turn cache-safety guard: it never rebuilds the tool list once a turn has started, so the cached prompt prefix is never invalidated mid-conversation. Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the name-based diff, in-place mutation, agent-scoped filtering, thread safety, and the config-driven discovery bound (incl. instant-return when nothing is pending). 75 passed across the touched areas.	2026-06-19 11:57:43 -07:00
Teknium	26e76a75e5	feat(telegram): opt-in Online/Offline bot status indicator (#49134 ) Sets the Telegram bot's short description (the line under its name) to "Online" on gateway connect and "Offline" on clean disconnect, gated behind extra.status_indicator (off by default). Telegram bots have no presence/online dot — that's a user-account feature the Bot API doesn't expose for bots. The short description is the closest available surface, so this gives users a way to tell whether the gateway is up from the bot's profile. - New extra.status_indicator flag (+ status_online/status_offline text overrides), read in __init__ via config.extra — no config-schema change. - _set_status_indicator() helper: best-effort, swallows API errors so it never blocks connect/disconnect; truncates to Telegram's 120-char cap. - Wired Online after _mark_connected(), Offline at top of disconnect() while the bot HTTP client is still alive. - 9 unit tests + Telegram docs section. Requested by @ilTrumpista, cc @Teknium.	2026-06-19 11:38:39 -07:00
Teknium	2a5e9d994a	Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)	2026-06-19 07:51:59 -07:00
Ben	1928aa0443	fix(managed-scope): honor managed scope in config→env bridges too Manual verification surfaced a second bypass class beyond the standalone config loaders: several code paths bridge config.yaml values into os.environ (HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*, network.force_ipv4, ...) by reading the raw user YAML, so the env the whole process reads carried the USER's value even when an administrator pinned it — e.g. a managed timezone was overridden because gateway/run.py wrote the user's timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var first. Wired the shared apply_managed_overlay() into every config→env bridge: - gateway/run.py module-level startup bridge (timezone, redact_secrets, max_turns, terminal, display, gateway.strict, ...) - gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn re-bridge that keeps config authoritative over reloaded .env — must keep MANAGED authoritative on every turn, not just startup) - hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge (runs before load_config is usable, at import time) - hermes_cli/send_cmd.py top-level scalar config→env bridge Verified end-to-end against a writable managed dir (12/12 checks incl. timezone, logging, model, skin, gateway settings, write-guard) and in a clean process the gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an order-independent regression test for the bridge overlay.	2026-06-19 07:46:33 -07:00
Ben	b0e47a98f9	fix(managed-scope): honor managed scope in all standalone config loaders The skin bug was one instance of a class: several subsystems build their config dict directly from config.yaml instead of routing through hermes_cli.config.load_config (which carries the managed merge), so they silently ignored administrator-pinned values. Audited every config.yaml reader and fixed the behavioral-read bypasses: - gateway/config.py load_gateway_config (messaging gateway: session_reset, quick_commands, stt, model, ...) - gateway/run.py _load_gateway_config (its read_raw_config fast path also skipped the merge — read_raw_config returns raw user YAML) - tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin, reasoning_effort, service_tier, provider_routing) - cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing) - hermes_logging.py (logging.level/max_size_mb/backup_count) - hermes_time.py (timezone) - hermes_cli/doctor.py (memory-provider diagnostic reads effective config) All route through a new shared managed_scope.apply_managed_overlay() helper that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't shadow a managed literal, root-model-string normalization, leaf-merge) and is fail-open. cli.py's earlier inline fix is refactored onto the same helper. Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile distribution) are deliberately left reading raw user YAML — overlaying managed values there would persist them into the user file. The dashboard (web_server.py) already routes through load_config and needed no change. TUI loader caches the RAW config so _save_cfg never writes managed values to disk. Adds test_managed_scope_overlay.py (helper) and test_managed_scope_loaders.py (per-surface integration); mutation-checked.	2026-06-19 07:46:33 -07:00
teknium1	a58287afcb	Merge remote-tracking branch 'origin/main' into pr48275-rebase # Conflicts: # cron/scheduler.py	2026-06-19 07:40:29 -07:00
Teknium	ba50e86563	fix: open dispatcher lock file with explicit utf-8 encoding ruff (unspecified-encoding) and the Windows-footgun checker both flag open() in text mode without encoture=. Keep text mode (the Windows lock path in _try_acquire_file_lock writes a str newline) and pass encoding='utf-8'.	2026-06-19 07:35:33 -07:00
Sahil Saghir	226e9322e1	fix(kanban): cross-platform dispatcher lock + explicit release Two robustness gaps from community review (#44919): 1. Windows dead-path: replaced bespoke fcntl.flock with gateway.status _try_acquire_file_lock / _release_file_lock — already cross-platform (msvcrt on Windows, fcntl on POSIX). Added _release_singleton_lock helper. 2. Lock fd never released: stored handle is now released explicitly in both exit paths — CancelledError handler and normal while-loop exit. Allows in-process stop/restart (tests, embedded use). Also tightened docstrings — 'corrupt the SQLite DBs' is now specific (wal_autocheckpoint=0 + concurrent manual WAL checkpoints can corrupt index pages), matching the module's own concurrency claims.	2026-06-19 07:35:33 -07:00
Sahil Saghir	dfa561092a	fix(kanban): machine-global singleton lock for the embedded dispatcher (#41448 ) The gateway's embedded dispatcher has no guard against more than one dispatcher running concurrently. dispatch_in_gateway defaults to true, so a second gateway for the same profile (a restart race where the old process is slow to exit) — or any deployment that runs multiple profile gateways with the default — starts a second dispatcher loop. As #41448 describes, concurrent dispatchers each run release_stale_claims() against the same boards, double reclaim frequency, and re-dispatch slow workers before they finish. In practice they also corrupt the shared kanban SQLite DBs under concurrent write load. Add _acquire_singleton_lock(): an exclusive, non-blocking fcntl.flock at the machine-global kanban root (kanban_home()/kanban/.dispatcher.lock — the board is shared across profiles by design, so this serialises every gateway, not just one profile). The first gateway to start its dispatcher holds the lock for its process lifetime; any other gateway finds it contended, logs, and skips dispatching while still running for messaging. Falls back to config-only control on non-POSIX or filesystems without flock. This is more robust than a per-profile guard because the documented model is "one dispatcher sweeps all boards" — the contention is across profiles, not just within one. Closes #41448. Test: lock is exclusive (held, then contended while held, then held again after release).	2026-06-19 07:35:33 -07:00
Ben Barclay	1e70df5fdd	feat(gateway): multiplex phase 4 — lifecycle guard + per-profile observability - _guard_named_profile_under_multiplexer: when the default gateway is running with gateway.multiplex_profiles=on, a named-profile 'hermes gateway run' hard -errors (pointing at the multiplexer) instead of double-binding that profile's platforms. Inert unless all hold: this invocation is a named profile, a default-profile gateway is alive, and its config has multiplexing on. --force overrides. Wired into run_gateway's guard chain. - write_runtime_status gains served_profiles: the secondary-adapter startup records [active] + multiplexed profiles into runtime_status.json so 'hermes status' can show per-profile coverage without a second probe. Absent for single-profile gateways. Tests: served_profiles round-trips and is absent by default; guard is inert for the default profile / under --force / when no default gateway is running.	2026-06-19 07:34:15 -07:00
Ben Barclay	d5d02eabb0	feat(gateway): multiplex phase 3 — secondary-profile adapter registry + conflict detection Bring up adapters for every profile the gateway serves, not just the active one. Keeps self.adapters as the default/active profile's map (the ~93 existing self.adapters[...] sites are untouched) and adds secondary profiles under self._profile_adapters[profile][platform]. - _start_secondary_profile_adapters loops profiles_to_serve(multiplex=True), skips the active profile (handled by the primary startup loop), and for each other profile loads its gateway config and creates+connects its enabled adapters under that profile's _profile_runtime_scope (home + secret scope). - Each secondary adapter gets _make_profile_message_handler(profile): stamps source.profile (when unset) before delegating to the shared _handle_message, so the agent turn and session key resolve to that profile. - Same-platform credential-conflict detection: _adapter_credential_fingerprint hashes the adapter's bot token (salted, truncated — never logs the token); two profiles claiming the same (platform, token) refuse the duplicate with a clear error naming both, since one token can't be polled twice. - Port-binding hard-error: a SECONDARY profile that enables a port-binding platform (webhook, api_server, msgraph_webhook, feishu, wecom_callback, bluebubbles, sms) is a config error and aborts startup via MultiplexConfigError — the default profile owns the single shared HTTP listener and serves every profile through the /p/<profile>/ prefix, so a second bind can only collide. Distinct from a transient connect failure (which logs + stays alive to retry): a config error writes gateway_state=startup_failed and exits cleanly with an actionable message (names the profile, the platform, and the fix). There is no valid reason to bind a second port once you've opted into a multiplexer. - Shutdown tears down secondary adapters alongside the primary ones. - Defensive getattr guards keep partial-construction unit tests (stop(), _run_agent on bare instances) working. No-op when multiplex_profiles is off (self._profile_adapters stays empty). Tests: fingerprint stability/log-safety/distinctness, profile message-handler stamping (and not overriding an already-stamped source), port-binding hard-error raises + names the profile/platform, non-binding platform is not rejected, and the guard set covers every TCP-binding adapter.	2026-06-19 07:34:15 -07:00
Ben Barclay	f35abb122a	feat(gateway): multiplex phase 1 — HTTP-inbound /p/<profile>/ routing (webhook) Serve webhook inbound for multiple profiles off the one shared listener via a URL prefix, with no second port bound. - SessionSource gains a 'profile' field (round-trips through to_dict/from_dict; omitted when unset so existing serialization is unchanged). It carries which profile an inbound message was routed to. - WebhookAdapter registers /p/{profile}/webhooks/{route_name} alongside the existing /webhooks/{route_name}. _resolve_request_profile validates the prefix against profiles_to_serve(): None when absent or multiplexing is off (ignored, handled as default — no spurious 404), the profile name when valid, _PROFILE_REJECTED (→ 404) when the profile isn't served. The resolved profile is stamped onto the SessionSource. - session-key namespacing and the per-turn home/credential scope now prefer source.profile: SessionStore._resolve_profile_for_key(source), _session_key_for_source fallback, and _resolve_profile_home_for_source all honor it (→ the agent turn resolves that profile's config/skills/credentials via the Phase 2 _profile_runtime_scope). Constraint: routing inbound needs no per-profile platform credential, but the agent still needs the routed profile's provider key — delivered by Phase 2's secret scope. api_server (OpenAI-compatible surface) profile routing is a focused follow-on; its source-construction path differs from webhook's. Tests: SessionSource.profile round-trip + namespace drive; _resolve_request_ profile accept/reject/ignore matrix.	2026-06-19 07:34:15 -07:00
Ben Barclay	f538470cf4	feat(gateway): multiplex phase 2 — fail-closed profile credential isolation (Workstream A) The credential gate. When multiplexing is active, a profile's secrets resolve from a context-local scope, never the process-global os.environ (which in a multiplexer may hold another profile's keys, and is inherited by every subprocess spawned with env=dict(os.environ)). - agent/secret_scope.py: get_secret() backed by a secret-scope contextvar. FAIL-CLOSED: when multiplex is active and no scope is installed, an unscoped read RAISES UnscopedSecretError instead of falling back to os.environ — a missed/new call site crashes loudly at that line rather than leaking a cross-profile value. Genuinely-global vars (HERMES_*, PATH, kanban paths, …) keep reading os.environ via an allowlist. load_env_file/build_profile_ secret_scope parse a profile .env into an isolated dict WITHOUT mutating os.environ. Off by default => transparent os.getenv behavior. - hermes_cli/runtime_provider.py: all credential/provider/base-url reads go through _getenv -> get_secret. - agent/credential_pool.py: env fallbacks route through get_secret (the ~/.hermes/.env-first preference is preserved and already profile-correct via the home override). - tools/mcp_tool.py: MCP config interpolation resolves through get_secret, so a server's picks up the routed profile's value. - gateway/run.py: set_multiplex_active() at GatewayRunner init; per-turn .env reload is a no-op for credentials in multiplex mode (secrets come from the scope, not global env); _profile_runtime_scope context manager combines the HERMES_HOME override + secret scope; _run_agent wraps _run_agent_inner in that scope (resolved via _resolve_profile_home_for_source) when multiplexing. Propagates into the agent worker thread for free via the existing copy_context() in _run_in_executor_with_context. Tests: 13 unit (fail-closed, scope isolation, global allowlist, .env parsing without environ mutation) + 7 E2E (runtime_provider + MCP interpolation prove two profiles isolated, unscoped read raises, globals still read environ).	2026-06-19 07:34:15 -07:00
Ben Barclay	d82f9fa7f7	feat(gateway): multiplex phase 0 — config flag, profile enumeration, profile-stamped session keys Foundations for serving multiple profiles from one gateway process, inert when off: - gateway.multiplex_profiles config flag (default false), round-trips through GatewayConfig and load_gateway_config (top-level + nested gateway.* form). - hermes_cli.profiles.profiles_to_serve(multiplex): the single chokepoint for which (profile, HERMES_HOME) pairs the gateway serves. Lightweight dir scan; active-profile-only when off, default + all named profiles when on. - build_session_key gains a profile= namespace slot. Default/None reuse the historical 'agent:main:...' literal BYTE-IDENTICALLY (no session migration, positional parsers unaffected); a named profile becomes 'agent:<profile>:...' so two profiles on the same platform/chat never collide. - SessionStore._resolve_profile_for_key + _session_key_for_source fallback resolve the namespace from the flag (legacy when off, active profile when on). Tests: byte-identical-when-off (parametrized), namespace isolation, positional layout preserved, config round-trip, profiles_to_serve enumeration.	2026-06-19 07:34:15 -07:00
teknium1	df2420f571	fix(gateway): keep non-Discord home-channel startup send byte-identical The salvaged non_conversational marking made the home-channel startup no-metadata branch always pass metadata= explicitly; for non-Discord platforms _non_conversational_metadata returns None, so Telegram/etc. went from adapter.send(chat_id, message) to adapter.send(..., metadata=None). Behaviorally identical but broke test_restart_notification's exact assert_called_once_with. Only attach metadata when the marker applies (Discord), restoring the original call shape elsewhere.	2026-06-19 07:29:27 -07:00
snav	caaa916289	fix(gateway): don't let delayed Discord status messages partition history backfill Discord channel-history backfill partitions on Hermes' last self-authored message. Asynchronous, non-conversational status sends (self-improvement review bubbles, heartbeats, background-process notifications, update status, gateway restart/online notices) land as ordinary bot messages, so a delayed status bump becomes the history boundary and swallows real messages that arrived after Hermes' actual reply. Mark these sends at the source via metadata["non_conversational"] (Discord only; other platforms' metadata is unchanged). The adapter no longer advances the history-boundary cache for marked sends and persists their IDs to a sidecar JSON so the cold-start scan can skip them by ID after a restart. A narrow regex recognizer remains only as an upgrade bridge for status bumps emitted by an older gateway that pre-dates the marking.	2026-06-19 07:29:27 -07:00
Alex Yates	fad4b40d9d	fix(model): persist /model switch by default across sessions A plain /model <name> switch only lasted for the current session — every new session reverted to the previously-configured model, so users had to re-switch every time (e.g. glm-5.1 -> glm-5.2 on every launch). Persist-by-default is now the behavior across all three /model surfaces (CLI, gateway, TUI/dashboard), gated by a new config key model.persist_switch_by_default (default true): /model <name> switch model (persists to config.yaml) /model <name> --session switch for this session only /model <name> --global switch and persist (explicit, unchanged) The effective persistence is resolved once via resolve_persist_behavior() in hermes_cli/model_switch.py so --session opts out, --global opts in, and the config-gated default applies otherwise. --global remains a valid explicit no-op alias for the new default.	2026-06-19 07:07:06 -07:00
Charles Power	715fa9ea1c	fix(gateway): harden gateway command-line matcher (review findings) Address correctness gaps found in pre-PR review of the strict matcher: - Profile selectors can appear on EITHER side of the `gateway` token (`_apply_profile_override` strips `--profile`/`-p` from anywhere in argv before argparse), so `hermes gateway --profile work run` and `python -m hermes_cli.main gateway -p work run` are valid launches the previous matcher wrongly rejected. Strip `--profile`/`-p`/`--profile=`/`-p=` from anywhere before locating the subcommand. - A profile literally named `gateway` (`hermes -p gateway gateway run`) made the old token scan stop on the profile value; stripping the selector+value first fixes it. - Tokenize quote-aware with `shlex` so quoted Windows paths containing spaces (`"C:\Program Files\Hermes\hermes-gateway.exe"`) are no longer split mid-path and the dedicated-entrypoint match survives. Without these, the matcher could MISS a real running gateway -> the opposite failure (restart/status reporting "down" when up). Adds regression tests for all three shapes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 06:31:56 -07:00
Charles Power	fd92a3a5c9	fix(gateway): Windows restart no longer causes a silent outage `hermes gateway restart` on Windows could take the gateway offline with no replacement. restart() was stop() -> sleep(1.0) -> start(), but the graceful drain can run up to ~180s while the detached pythonw process stays alive. The 1s sleep let start() run against the still-draining old process; its "already running" guard then no-opped, and when the old process finally exited nothing relaunched it. Two root causes, both fixed: 1. Loose PID detection. `_scan_gateway_pids` and the gateway.status helpers used substring matches ("... gateway" in cmdline) for lifecycle decisions, so they false-matched `gateway status`/`dashboard` siblings and unrelated processes like `python -m tui_gateway`, plus stale gateway.pid records. Add a shared strict matcher `looks_like_gateway_command_line()` in gateway/status.py that requires the real `gateway run` subcommand (or the dedicated entrypoints), and route `_looks_like_gateway_process`, `_record_looks_like_gateway`, and `_scan_gateway_pids` through it. 2. restart() race. Wait until the gateway is authoritatively gone (`get_running_pid()` + strict `_gateway_pids()`) before relaunch; force-kill once if it lingers and raise rather than start a duplicate; verify the relaunch produced a running gateway and raise loudly if not (no more exit-0 silent outage). Scoped to Windows; systemd/launchd restart paths are already drain-aware. Adds tests/gateway/test_gateway_command_line_matcher.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 06:31:56 -07:00
infinitycrew39	ca92e9a362	fix(gateway): refresh cached agent max_iterations from current config When a gateway agent is reused from cache, it retains the max_iterations from its initial creation. If config.yaml agent.max_turns or HERMES_MAX_ITERATIONS changed between turns, the cached agent's budget becomes stale. Before reusing a cached agent, refresh agent.max_iterations from the freshly-resolved value (read from env/config at line 14585). Fixes partial issue from PR #48127: handles fresh agent creation + cached agent reuse.	2026-06-19 06:31:13 -07:00
infinitycrew39	460b1e50e5	fix(gateway): refresh max_turns before resolving runtime budget	2026-06-19 06:31:13 -07:00
Ben Barclay	a64fc490fe	fix(relay): make hosted gateways actually connect AND complete the inbound/outbound round-trip (#48828 ) * fix(relay): enable RELAY platform + normalize dial URL so hosted gateways actually connect Three bugs blocked a self-provisioned hosted gateway from ever establishing its inbound relay WS (found while standing up the live staging end-to-end). Each masked the next; all three are needed for inbound to work. 1. RELAY platform never enabled in config.platforms (gateway/config.py). register_relay_adapter() puts the adapter in the platform_registry, but start_gateway()'s connect loop iterates self.config.platforms — which never contained Platform.RELAY. So the adapter was "registered" but never connected (logs showed "relay adapter registered" then "No messaging platforms enabled"). Fix: _apply_env_overrides now enables Platform.RELAY (mirroring relay_url into extra for the connected-checker) when GATEWAY_RELAY_URL (env) or gateway.relay_url (yaml) is set. Absent -> no RELAY entry (direct/ single-tenant gateways unaffected). 2. URL scheme not converted for the WS dial (gateway/relay/ws_transport.py). The relay URL is configured once as the http(s):// base (used as-is for the provision POST), but websockets.connect rejects http(s):// with "scheme isn't ws or wss". Fix: _ws_dial_url converts https->wss / http->ws. 3. /relay path not appended (same helper). The connector mounts its WebSocketServer at path "/relay" and returns HTTP 400 on an upgrade to any other path. GATEWAY_RELAY_URL is the base (no /relay), so the dial hit "/" -> 400. Fix: _ws_dial_url ensures the path ends in /relay. Idempotent — a URL already carrying ws(s):// and/or /relay is unchanged, so provision's _provision_url (which derives /relay/provision from either form) still works. Why the cross-repo E2E missed #2/#3: the stub connector binds ws://host:port and its websockets.serve accepts ANY path, so neither the scheme nor the /relay path was exercised. Real connector needs both. Verified live on staging hermes-agent-stg-automated-perception-5054: after the fixes the gateway logs "Connecting to relay..." -> "✓ relay connected" -> "Gateway running with 1 platform(s)" against wss://gateway-gateway.staging-nousresearch.com/relay, stable. Tests: added _ws_dial_url scheme+path+idempotency cases (test_ws_transport.py) and RELAY-platform-enablement cases for env + yaml + absent (test_config.py). Full gateway/relay + config suites green (191 passed). Relay-adapter lane. EXPERIMENTAL. * fix(relay): re-attach guild_id to outbound so connector egress resolves the tenant The final bug in the hosted-relay round-trip. Inbound worked end to end (Discord -> connector -> bus -> agent WS -> agent runs -> reply), but the reply's egress was declined by the connector: "discord egress declined: target not routed to an onboarded tenant". Cause: the connector's routedEgressGuard resolves the owning tenant from the OUTBOUND action's metadata.guild_id (Discord's routing discriminator). The gateway's generic delivery path builds outbound metadata via run.py _thread_metadata_for_source, which only carries thread_id (and returns None entirely for a non-threaded message) — so guild_id never reached the connector, tenant resolution failed, and the shared bot refused to post. Fix (relay-adapter-local, no perturbation of the generic delivery path or other platforms): RelayAdapter learns chat_id -> guild_id from each inbound event (_capture_scope) and re-attaches it to the outbound action's metadata in send() (_with_scope) when not already present. No-op for chats we never saw inbound (e.g. DMs) and never overwrites an explicit guild_id. Verified live on staging hermes-agent-stg-automated-perception-5054: an @mention in #general now produces a visible bot reply — full multi-tenant relay round-trip (real Discord -> shared connector bot -> tenant routing -> agent WS -> reply egress -> Discord). Tests: _capture_scope/_with_scope reattach, no-scope no-op, explicit-guild_id preserved (test_relay_adapter.py). Full relay + config suites green (160 passed). Relay-adapter lane. EXPERIMENTAL.	2026-06-19 16:30:24 +10:00
Ben	637aff46e7	Merge remote-tracking branch 'origin/main' into hermes/hermes-6fe26723	2026-06-19 15:17:13 +10:00
Ben Barclay	2c6e266e88	fix(relay): trigger self-provision on relay-config + NAS token, not is_managed() (#48724 ) self_provision_if_managed() gated on is_managed(), but is_managed() means "NixOS/package-manager-managed" (it keys on HERMES_MANAGED or a ~/.hermes/.managed marker) — NOT "NAS-hosted". A NAS-provisioned Fly agent sets NEITHER, so the gate was always False and relay self-provision SILENTLY no-oped on exactly the hosted agents it was built for. Caught live: a staging agent with GATEWAY_RELAY_URL correctly stamped logged "No messaging platforms enabled" and never dialed the connector; HERMES_MANAGED was unset on the machine. The unit tests had mocked is_managed()->True, so they passed while the real trigger never fired (mocked- trigger blind spot). Fix: drop the is_managed() gate and rename self_provision_if_managed -> self_provision_relay. The real trigger is now "relay_url() set + no pinned secret + a resolvable NAS token", which is both NAS-independent and self-guarding: - NAS-hosted agent: GATEWAY_RELAY_URL + no pinned secret + bootstrapped NAS token -> self-provisions. - Self-hosted + `hermes gateway enroll`: pinned GATEWAY_RELAY_SECRET -> skipped (existing secret-present guard). - Self-hosted, unenrolled, no NAS identity: resolve_nous_access_token() fails -> graceful no-op (existing fail-soft path). Security: unchanged trust model. The connector still derives tenant from the validated NAS token; this only broadens WHEN the provision attempt fires, and every broadened case is still guarded by token-resolution + pinned-secret-skip. Tests: replaced the (wrong) "skips when not managed" test with a regression test proving a NAS host where is_managed()==False STILL provisions; renamed all call sites; added a "no NAS token -> non-fatal skip" test for the self-hosted branch. 88 relay tests pass. Relay-adapter lane. EXPERIMENTAL.	2026-06-19 01:01:24 +00:00
Ben Barclay	d2c53ff558	feat(relay): WS-only inbound on the gateway adapter (Phase 3) (#48294 ) The connector now delivers inbound (messages + interrupts) over the gateway's OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The gateway needs no inbound HTTP port — which is what makes hosted gateways (no public IP) able to receive inbound at all. - gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler( self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into the existing per-session interrupt path (the inbound message handler was already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner lifecycle — there is no HTTP receiver anymore. - gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery receiver). - gateway/relay/__init__.py: removed relay_inbound_config() (dead with the receiver gone). The delivery key is still set in-process by self-provision for forward-compat but is no longer consumed for inbound. - docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path is documented as superseded. gatewayEndpoint noted as passthrough-plane only. Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt; new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an interrupt_inbound frame over the WS cancels the right session. Removed the HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key assertion. 88 relay tests pass. EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full multi-instance path against this production adapter code.	2026-06-19 09:33:15 +10:00
teknium1	0879d5cc8f	fix(gateway): preserve original transcript when /compress rotation is skipped The manual /compress handler called rewrite_transcript() unconditionally on the session id returned by _compress_context(). When rotation does not occur (e.g. _session_db unavailable, or the DB split raised), session_id is unchanged and rewrite_transcript() DELETEs the original messages and replaces them with only the compressed summary — permanent data loss (#44794, #39704). Guard the rewrite on actual rotation: only overwrite when _compress_context produced a new session id. Otherwise leave the original transcript intact and log a warning.	2026-06-18 13:38:35 -07:00

1 2 3 4 5 ...

2117 commits