Commit graph

2117 commits

Author SHA1 Message Date
Teknium
680732c104
fix(gateway): never interrupt a busy session with an internal completion event (#49738)
Async-delegation completions (delegate_task(background=true)) and
background-process completions (terminal notify_on_complete) re-enter the
originating session as internal MessageEvents. When the session was busy,
_handle_active_session_busy_message treated them like a user TEXT message and
the default busy_input_mode='interrupt' aborted the active turn (and sent a
'Interrupting current task' ack) — the opposite of the design invariant that a
completion surfaces as a new turn only when idle.

Short-circuit internal events to return False so the base adapter queues them
silently (it already excludes internal events from debounce), cascading them as
the next turn after the current one finishes.
2026-06-20 10:57:41 -07:00
kshitijk4poor
1fbf48d4ad fix(compression): make in-place compaction durable + rotation-independent end-to-end
Review (Codex + 3-agent parallel) found the first cut of in-place mode was
incomplete: it only updated the system prompt, so the persisted transcript
stayed 'full history + summary' and the next turn/resume reloaded the full
history and immediately re-compacted (a loop), and every downstream layer
that keyed off session-id rotation silently no-op'd. The session_id was
doing double duty as the 'compaction happened' signal. This wires the whole
path so removing rotation is actually complete:

Agent (agent/conversation_compression.py):
- In-place now DURABLY replaces the transcript: replace_messages(session_id,
  compressed) on the same row (the canonical store the gateway reloads from),
  not just update_system_prompt. Resume reloads the compacted set; no loop.
- Reset flush identity/cursor (_last_flushed_db_idx=0, _flushed_db_message_ids
  cleared) so next-turn appends diff against the compacted transcript.
- Expose a rotation-independent signal: agent._last_compaction_in_place, and
  in_place=True on the session:compress event.
- Fire the compaction-boundary hooks (context-engine on_session_start, memory
  manager on_session_switch, reason='compression') in BOTH modes — in-place
  passes the same id as parent so DAG/buffer state still checkpoints. Without
  this, memory/context plugins miss every in-place compaction.

Gateway auto-compress (gateway/run.py):
- Read agent._last_compaction_in_place; set history_offset=0 on rotation OR
  in-place (both return the compacted set, so slicing past the pre-compaction
  length would drop everything). Carry compacted_in_place in the result dict.
- No extra rewrite needed: the agent shares the gateway's SessionDB, so its
  replace_messages already updated the canonical store load_transcript reads.

Manual /compress (gateway/slash_commands.py):
- The throwaway /compress agent has no _session_db, so rewrite_transcript is
  the durable write. Previously gated behind 'if rotated:' which treated
  'id unchanged' as the #44794 data-loss failure case and SKIPPED the rewrite
  — making /compress a silent no-op in in-place mode. Now rewrites on rotated
  OR in_place; the data-loss guard still fires only for the genuine
  no-rotation-AND-not-in-place failure.

Hygiene auto-compress already writes _compressed to the same id
unconditionally (its agent has no _session_db, can't rotate) — correct for
in-place, no change.

Tests (tests/run_agent/test_in_place_compaction.py):
- Assert the DURABLE transcript IS the compacted set after reload
  (get_messages_as_conversation == compacted), message_count==2, flush
  identity reset, and the rotation-independent signal set on in-place /
  unset on rotation. Rotation regression guard unchanged.

Verified: 64 tests green across in-place + rotation/persistence/boundary/
concurrent/failure-sync/command/cli suites; E2E both modes (durable replace,
gateway offset=0, rotation preserves old transcript); ruff clean. Still
default-off.
2026-06-20 10:57:07 -07:00
Teknium
5600105478 refactor(gateway): migrate slack/dingtalk/whatsapp/matrix/feishu/telegram/wecom/email/sms adapters to bundled plugins
Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging
adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite,
wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained
bundled plugins under plugins/platforms/<x>/, discovered via the platform
registry. Strips the per-platform core touchpoints from gateway/run.py,
gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and
tools/send_message_tool.py.

Carries forward the migration fixes (explicit enabled:false honored,
get_connected_platforms forces discovery, plugin is_connected via
gateway.get_env_value, logs --component gateway matches plugins.platforms.*,
matrix hidden on Windows).

Additionally ports config keys main added since the PR base: the matrix
plugin's _apply_yaml_config now also covers allowed_users,
ignore_user_patterns, process_notices, and session_scope (the inline
gateway/config.py matrix block gained these in the 1340 commits the PR sat
open; they would otherwise have been silently dropped on deletion).
2026-06-20 10:26:45 -07:00
kshitijk4poor
26d9a3c710 fix(signal): FIFO-evict the quote-detection timestamp cache
`_sent_message_timestamps` (the reply-to-own-message quote cache) used a
`set` evicted with `set.pop()`, which removes an ARBITRARY element — so once
more than the cap (500) outbound timestamps are tracked, a still-recent
timestamp could be dropped while older ones survive, missing a genuine
reply-to-own-message. Convert it to an OrderedDict with FIFO (oldest-first)
eviction, mirroring the recently-hardened echo ring (#31250). This closes the
same bug class on the sibling cache.

Adds a regression test asserting oldest-first eviction + MRU promotion.
2026-06-20 21:00:46 +05:30
w31rdm4ch1nZ
332f88f6a6 fix(signal): harden recently-sent echo ring with LRU + TTL 2026-06-20 20:50:52 +05:30
kshitijk4poor
32a97a20af fix(signal): strip self-mention in all groups, not just require_mention
Review follow-up on the salvaged self-mention strip (#31217): the original
only stripped the bot's rendered @<number>/@<uuid> self-mention inside the
`require_mention=true` branch, so groups with require_mention=false still
leaked it into the agent text. Hoist the strip to run for every group message
(fixing the whole bug class), and collapse the doubled space a mid-sentence
removal leaves while preserving intentional newlines.
2026-06-20 16:27:28 +05:30
Kailigithub
40b6ac9ac7 fix(signal): send explicit stop-typing RPC when cancelling indicator 2026-06-20 16:23:41 +05:30
Rick Ratmansky
96b10327b6 fix(signal): strip bot self-mention from group messages before agent dispatch 2026-06-20 16:23:41 +05:30
lkz-de
96db7c6883 fix(signal): preserve quoted reply context
Carry Signal quote metadata through gateway events so replies to assistant messages include the quoted context without personalizing comments.
2026-06-20 15:16:53 +05:30
kshitij
ff50a88617
Merge pull request #49558 from NousResearch/salvage/env-var-guards-48735 2026-06-20 15:11:54 +05:30
kshitijk4poor
a7dd98c860 fix(env): guard remaining malformed int/float env var casts with utils helpers
Widen the env_float() guard from #48735 across the whole bug class: a
non-numeric value (e.g. a stale .env "HERMES_API_TIMEOUT=abc" or a typo'd
port) raised an unhandled ValueError and crashed adapter/agent init.

Converts 22 genuinely-unguarded first-party int/float(os.getenv()) sites to
the canonical utils.env_int / utils.env_float helpers (the established house
pattern), instead of duplicating per-module helpers or inline try/except:

- gateway/config.py: WECOM_CALLBACK_PORT, BLUEBUBBLES_WEBHOOK_PORT
- gateway/platforms/email.py: EMAIL_IMAP/SMTP_PORT, EMAIL_POLL_INTERVAL
- gateway/platforms/feishu.py: dedup cache + text/media batch settings
- gateway/platforms/wecom.py, discord/adapter.py: text batch delays
- gateway/platforms/telegram.py: media batch delay, TELEGRAM_WEBHOOK_PORT
- gateway/platforms/whatsapp.py: WHATSAPP_NPM_INSTALL_TIMEOUT
- hermes_cli/auth.py: CODEX/XAI refresh timeouts
- agent/chat_completion_helpers.py: API/stream read/stale timeouts
- run_agent.py, agent/auxiliary_client.py: API + nous timeouts

Sites already guarded by try/except or local helpers are left untouched.
The HERMES_MAX_ITERATIONS sites are already guarded on main via
_current_max_iterations(), so they are not included.
2026-06-20 14:54:36 +05:30
kshitijk4poor
abafba0762 refactor(signal): correct STT-fallback comment, type the markdown wrapper, make AAC test portable
Review follow-up on the salvaged AAC + markdown changes:
- Fix an inaccurate comment claiming the STT layer has a sniff-and-remux
  fallback (verified: no such fallback exists; the ffmpeg-absent path caches
  raw ADTS and STT may reject it).
- Type the _markdown_to_signal wrapper as tuple[str, list[str]] to match the
  shared helper instead of a bare tuple.
- Replace the hardcoded /home/pi/... test fixture with a runtime-generated
  ADTS AAC sample so the remux round-trip actually runs in CI (skips only
  when ffmpeg is absent) instead of always-skipping.
2026-06-20 14:24:29 +05:30
jasnoorgill
da34fca2bb fix(signal): detect ADTS AAC voice notes and remux to MP4
Android Signal delivers voice notes as raw ADTS AAC frames, which
share the `0xFF 0xFx` sync word with MPEG-1/2 Layer 3 (MP3). The
`_guess_extension` byte-signature test in gateway/platforms/signal.py
was matching both, so ADTS AAC was being misclassified as MP3 — saved
to disk with the wrong extension and rejected by every major STT API
(Groq, OpenAI) because their server-side format sniffers inspect the
actual codec, not the file extension.

Two changes:

1. Tighten the MP3 vs ADTS disambiguator. ADTS packs `ID`,
   `layer`, and `protection_absent` into bits 3-0 of byte 1, where
   `ID=0` and `layer=00` for AAC. Real MP3 has `ID=1` and
   `layer` in {01, 10, 11}. The mask `0xF6` against target `0xF0`
   cleanly separates them.

2. Remux raw ADTS AAC to MP4 container at the cache step via
   `ffmpeg -c:a copy`. Single demux/remux, no re-encode, no quality
   loss, sub-100ms on a Pi 5. The cached file is a normal `.m4a`
   that all major STT providers accept. ffmpeg is a transitive
   dependency of many other Hermes features (TTS, video skills) so
   this isn't a new install requirement; the remux degrades
   gracefully to a no-op if ffmpeg is missing.

The new helper `_remux_aac_to_m4a` is unit-tested with a real
Android voice note from the audio cache that originally triggered
the bug, plus synthetic ADTS frames for the byte-level
disambiguator and garbage-input graceful failure.

Closes the gap that broke transcription for any Android Signal user
sending voice messages to Hermes.
2026-06-20 13:48:05 +05:30
lkz-de
905820b59f fix(signal): share markdown formatting across send paths
Route Signal send paths through shared markdown formatting helpers and render markdown bullets consistently as Unicode bullets. Add coverage for Signal formatting and send_message integration.
2026-06-20 13:47:14 +05:30
helix4u
c253b07380 fix(model): clear stale endpoint credentials across switches 2026-06-19 19:58:26 -07:00
joaomarcos
75ed07ace8 fix(gateway): break the restart loop at the source on session resume
When a tool call itself restarts the gateway (docker restart, systemctl
restart, and similar), the process is terminated mid-call — before the
tool result is persisted and before the orderly drain rewind can run. The
transcript tail is left as an assistant(tool_calls) with no matching tool
answer. On resume the model re-issues the unanswered call, taking the
gateway down again — an infinite loop (#49201).

Source fix: _build_gateway_agent_history now strips a trailing
assistant(tool_calls) block that has no tool answers
(_strip_dangling_tool_call_tail), so there is nothing for the model to
re-execute. This complements _strip_interrupted_tool_tails, which only
handles the case where a tool result row exists with an interrupt marker.

Cognitive backstop: the resume-pending system note now states that any
restart command in the history already ran and must not be re-executed or
verified, and the empty-message auto-resume startup turn reports recovery
and asks for instructions instead of the nonsensical "address the user's
NEW message" (there is no new message on that turn).

Reimplements the intent of #49243 by @JoaoMarcos44 at the replay layer.

Fixes #49201
2026-06-19 16:59:58 -07:00
joaomarcos
3a6c171e9e fix(gateway): log signal transport response and bubble cron live adapter errors 2026-06-19 16:59:38 -07:00
joaomarcos
5649b8649a Fix silent delivery failures in Signal live adapter (#49260) 2026-06-19 16:59:38 -07:00
Gille
a7983d5ad7
fix(dashboard): hide sidecar sessions from history (#49269)
* fix(dashboard): hide sidecar sessions from history

* test(dashboard): allow sidecar source in session payload
2026-06-19 18:06:38 -04:00
Evo
2fe78d1ae3 fix(gateway): persist inline-keyboard model-picker selections by default
#49066 made /model text and the CLI picker persist to config.yaml by
default, but the gateway (Telegram/Discord/Matrix) inline-keyboard picker
callback stayed session-only. Mirror the text path's persist block so a
tapped model survives across launches like a typed one.
2026-06-20 02:32:44 +05:30
kshitijk4poor
d4e7dd609d refactor(windows): tidy managed-node resolver helpers
Behavior-preserving cleanups on the managed-node resolver:
- Hoist _candidate_node_command_names() out of the inner dir loop in
  find_hermes_node_executable (computed once, not per directory).
- Drop redundant os.environ.copy() at the two with_hermes_node_path(
  os.environ.copy()) sites \u2014 the helper already copies os.environ when
  called with no argument (verified env-equivalent).
- Add reciprocal keep-in-sync comments between iter_hermes_node_dirs()
  (hermes_constants.py) and hermesManagedNodePathEntries() (electron
  main.cjs), which mirror the same platform-ordering rule across the
  Python/Node boundary.
2026-06-20 02:12:16 +05:30
helix4u
7a7b56d498 fix(windows): prefer managed node for whatsapp and desktop 2026-06-20 02:00:37 +05:30
hakanpak
38f1a923af fix(gateway): rename the Telegram topic from /title, not only auto-titles
Auto-generated session titles already rename the Telegram forum topic via
the title_callback path, but the /title command only wrote the session
title to the database. On a Telegram topic lane the visible topic kept its
auto-assigned name, so a user who ran /title to override it saw no change.

Propagate the user-chosen title to the topic by calling the existing
_schedule_telegram_topic_title_rename helper on a successful /title set. It
already no-ops off Telegram topic lanes and when auto-rename is disabled.
2026-06-20 01:54:16 +05:30
alt-glitch
f3e967aae5 fix(mcp): round-3 polish — generation capture adjacency + gateway contract note
Third review pass (Hermes subagent) declared convergence: no BLOCKING, the
round-2 generation-aware publish / context-engine staging / CLI reload / ACP
routing all verified correct by hand and by test.

- agent_init: capture _tool_snapshot_generation immediately before the tool
  snapshot (was ~425 lines earlier); removes a harmless skew window so the
  recorded generation always matches the snapshot it describes.
- gateway/run.py _execute_mcp_reload: keep preserving each cached agent's
  build-time enabled_toolsets EXACTLY (do NOT merge newly-connected servers like
  CLI/TUI do) and document WHY — gateway sessions can be deliberately locked
  down, and test_reload_mcp_preserves_per_agent_toolset_overrides asserts this.
  A reviewer suggested "parity" here; it would have violated that contract.
2026-06-19 11:57:43 -07:00
alt-glitch
93d6e73028 fix(mcp): expose late-connecting MCP tools to the agent (TUI/CLI/gateway)
MCP servers that connect after the agent's one-time tool snapshot were
invisible for the whole session. Two root causes, fixed together:

1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers
   commonly take 2-6s on a cold connect, so they missed the window and
   their tools never entered the agent's snapshot. `thread.join(timeout)`
   already returns the instant discovery completes, so raising the bound
   costs ~0s for the common case (no MCP / fast servers) and only ever
   blocks for a genuinely-pending server, capped so a dead server can't
   freeze startup. The bound is now configurable via
   `mcp_discovery_timeout` (config.yaml, default 5.0s).

2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI
   `reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh
   thread), and the late-refresh detected changes by tool COUNT — missing
   an equal-size add/remove swap. Consolidated into one shared
   `tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by
   tool NAME, mutates the agent under a lock (thread-safe), and respects
   the agent's own enabled/disabled toolsets.

The late-binding refresh keeps its pre-first-turn cache-safety guard:
it never rebuilds the tool list once a turn has started, so the cached
prompt prefix is never invalidated mid-conversation.

Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the
name-based diff, in-place mutation, agent-scoped filtering, thread
safety, and the config-driven discovery bound (incl. instant-return
when nothing is pending). 75 passed across the touched areas.
2026-06-19 11:57:43 -07:00
Teknium
26e76a75e5
feat(telegram): opt-in Online/Offline bot status indicator (#49134)
Sets the Telegram bot's short description (the line under its name) to
"Online" on gateway connect and "Offline" on clean disconnect, gated
behind extra.status_indicator (off by default).

Telegram bots have no presence/online dot — that's a user-account
feature the Bot API doesn't expose for bots. The short description is
the closest available surface, so this gives users a way to tell whether
the gateway is up from the bot's profile.

- New extra.status_indicator flag (+ status_online/status_offline text
  overrides), read in __init__ via config.extra — no config-schema change.
- _set_status_indicator() helper: best-effort, swallows API errors so it
  never blocks connect/disconnect; truncates to Telegram's 120-char cap.
- Wired Online after _mark_connected(), Offline at top of disconnect()
  while the bot HTTP client is still alive.
- 9 unit tests + Telegram docs section.

Requested by @ilTrumpista, cc @Teknium.
2026-06-19 11:38:39 -07:00
Teknium
2a5e9d994a
Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos
feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)
2026-06-19 07:51:59 -07:00
Ben
1928aa0443 fix(managed-scope): honor managed scope in config→env bridges too
Manual verification surfaced a second bypass class beyond the standalone
config loaders: several code paths bridge config.yaml values into os.environ
(HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*,
network.force_ipv4, ...) by reading the raw user YAML, so the env the whole
process reads carried the USER's value even when an administrator pinned it —
e.g. a managed timezone was overridden because gateway/run.py wrote the user's
timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var
first.

Wired the shared apply_managed_overlay() into every config→env bridge:

- gateway/run.py module-level startup bridge (timezone, redact_secrets,
  max_turns, terminal, display, gateway.strict, ...)
- gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn
  re-bridge that keeps config authoritative over reloaded .env — must keep
  MANAGED authoritative on every turn, not just startup)
- hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge
  (runs before load_config is usable, at import time)
- hermes_cli/send_cmd.py top-level scalar config→env bridge

Verified end-to-end against a writable managed dir (12/12 checks incl. timezone,
logging, model, skin, gateway settings, write-guard) and in a clean process the
gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an
order-independent regression test for the bridge overlay.
2026-06-19 07:46:33 -07:00
Ben
b0e47a98f9 fix(managed-scope): honor managed scope in all standalone config loaders
The skin bug was one instance of a class: several subsystems build their
config dict directly from config.yaml instead of routing through
hermes_cli.config.load_config (which carries the managed merge), so they
silently ignored administrator-pinned values. Audited every config.yaml
reader and fixed the behavioral-read bypasses:

- gateway/config.py load_gateway_config (messaging gateway: session_reset,
  quick_commands, stt, model, ...)
- gateway/run.py _load_gateway_config (its read_raw_config fast path also
  skipped the merge — read_raw_config returns raw user YAML)
- tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin,
  reasoning_effort, service_tier, provider_routing)
- cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing)
- hermes_logging.py (logging.level/max_size_mb/backup_count)
- hermes_time.py (timezone)
- hermes_cli/doctor.py (memory-provider diagnostic reads effective config)

All route through a new shared managed_scope.apply_managed_overlay() helper
that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't
shadow a managed literal, root-model-string normalization, leaf-merge) and is
fail-open. cli.py's earlier inline fix is refactored onto the same helper.

Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile
distribution) are deliberately left reading raw user YAML — overlaying managed
values there would persist them into the user file. The dashboard
(web_server.py) already routes through load_config and needed no change.

TUI loader caches the RAW config so _save_cfg never writes managed values to
disk. Adds test_managed_scope_overlay.py (helper) and
test_managed_scope_loaders.py (per-surface integration); mutation-checked.
2026-06-19 07:46:33 -07:00
teknium1
a58287afcb
Merge remote-tracking branch 'origin/main' into pr48275-rebase
# Conflicts:
#	cron/scheduler.py
2026-06-19 07:40:29 -07:00
Teknium
ba50e86563 fix: open dispatcher lock file with explicit utf-8 encoding
ruff (unspecified-encoding) and the Windows-footgun checker both flag
open() in text mode without encoture=. Keep text mode (the Windows lock
path in _try_acquire_file_lock writes a str newline) and pass
encoding='utf-8'.
2026-06-19 07:35:33 -07:00
Sahil Saghir
226e9322e1 fix(kanban): cross-platform dispatcher lock + explicit release
Two robustness gaps from community review (#44919):

1. Windows dead-path: replaced bespoke fcntl.flock with gateway.status
   _try_acquire_file_lock / _release_file_lock — already cross-platform
   (msvcrt on Windows, fcntl on POSIX). Added _release_singleton_lock
   helper.

2. Lock fd never released: stored handle is now released explicitly in
   both exit paths — CancelledError handler and normal while-loop exit.
   Allows in-process stop/restart (tests, embedded use).

Also tightened docstrings — 'corrupt the SQLite DBs' is now specific
(wal_autocheckpoint=0 + concurrent manual WAL checkpoints can corrupt
index pages), matching the module's own concurrency claims.
2026-06-19 07:35:33 -07:00
Sahil Saghir
dfa561092a fix(kanban): machine-global singleton lock for the embedded dispatcher (#41448)
The gateway's embedded dispatcher has no guard against more than one dispatcher
running concurrently. dispatch_in_gateway defaults to true, so a second gateway
for the same profile (a restart race where the old process is slow to exit) — or
any deployment that runs multiple profile gateways with the default — starts a
second dispatcher loop. As #41448 describes, concurrent dispatchers each run
release_stale_claims() against the same boards, double reclaim frequency, and
re-dispatch slow workers before they finish. In practice they also corrupt the
shared kanban SQLite DBs under concurrent write load.

Add _acquire_singleton_lock(): an exclusive, non-blocking fcntl.flock at the
machine-global kanban root (kanban_home()/kanban/.dispatcher.lock — the board is
shared across profiles by design, so this serialises every gateway, not just one
profile). The first gateway to start its dispatcher holds the lock for its
process lifetime; any other gateway finds it contended, logs, and skips
dispatching while still running for messaging. Falls back to config-only control
on non-POSIX or filesystems without flock.

This is more robust than a per-profile guard because the documented model is
"one dispatcher sweeps all boards" — the contention is across profiles, not just
within one. Closes #41448.

Test: lock is exclusive (held, then contended while held, then held again after
release).
2026-06-19 07:35:33 -07:00
Ben Barclay
1e70df5fdd feat(gateway): multiplex phase 4 — lifecycle guard + per-profile observability
- _guard_named_profile_under_multiplexer: when the default gateway is running
  with gateway.multiplex_profiles=on, a named-profile 'hermes gateway run' hard
  -errors (pointing at the multiplexer) instead of double-binding that
  profile's platforms. Inert unless all hold: this invocation is a named
  profile, a default-profile gateway is alive, and its config has multiplexing
  on. --force overrides. Wired into run_gateway's guard chain.
- write_runtime_status gains served_profiles: the secondary-adapter startup
  records [active] + multiplexed profiles into runtime_status.json so
  'hermes status' can show per-profile coverage without a second probe. Absent
  for single-profile gateways.

Tests: served_profiles round-trips and is absent by default; guard is inert for
the default profile / under --force / when no default gateway is running.
2026-06-19 07:34:15 -07:00
Ben Barclay
d5d02eabb0 feat(gateway): multiplex phase 3 — secondary-profile adapter registry + conflict detection
Bring up adapters for every profile the gateway serves, not just the active
one. Keeps self.adapters as the default/active profile's map (the ~93 existing
self.adapters[...] sites are untouched) and adds secondary profiles under
self._profile_adapters[profile][platform].

- _start_secondary_profile_adapters loops profiles_to_serve(multiplex=True),
  skips the active profile (handled by the primary startup loop), and for each
  other profile loads its gateway config and creates+connects its enabled
  adapters under that profile's _profile_runtime_scope (home + secret scope).
- Each secondary adapter gets _make_profile_message_handler(profile): stamps
  source.profile (when unset) before delegating to the shared _handle_message,
  so the agent turn and session key resolve to that profile.
- Same-platform credential-conflict detection: _adapter_credential_fingerprint
  hashes the adapter's bot token (salted, truncated — never logs the token);
  two profiles claiming the same (platform, token) refuse the duplicate with a
  clear error naming both, since one token can't be polled twice.
- Port-binding hard-error: a SECONDARY profile that enables a port-binding
  platform (webhook, api_server, msgraph_webhook, feishu, wecom_callback,
  bluebubbles, sms) is a config error and aborts startup via MultiplexConfigError
  — the default profile owns the single shared HTTP listener and serves every
  profile through the /p/<profile>/ prefix, so a second bind can only collide.
  Distinct from a transient connect failure (which logs + stays alive to retry):
  a config error writes gateway_state=startup_failed and exits cleanly with an
  actionable message (names the profile, the platform, and the fix). There is no
  valid reason to bind a second port once you've opted into a multiplexer.
- Shutdown tears down secondary adapters alongside the primary ones.
- Defensive getattr guards keep partial-construction unit tests (stop(),
  _run_agent on bare instances) working.

No-op when multiplex_profiles is off (self._profile_adapters stays empty).

Tests: fingerprint stability/log-safety/distinctness, profile message-handler
stamping (and not overriding an already-stamped source), port-binding hard-error
raises + names the profile/platform, non-binding platform is not rejected, and
the guard set covers every TCP-binding adapter.
2026-06-19 07:34:15 -07:00
Ben Barclay
f35abb122a feat(gateway): multiplex phase 1 — HTTP-inbound /p/<profile>/ routing (webhook)
Serve webhook inbound for multiple profiles off the one shared listener via a
URL prefix, with no second port bound.

- SessionSource gains a 'profile' field (round-trips through to_dict/from_dict;
  omitted when unset so existing serialization is unchanged). It carries which
  profile an inbound message was routed to.
- WebhookAdapter registers /p/{profile}/webhooks/{route_name} alongside the
  existing /webhooks/{route_name}. _resolve_request_profile validates the
  prefix against profiles_to_serve(): None when absent or multiplexing is off
  (ignored, handled as default — no spurious 404), the profile name when valid,
  _PROFILE_REJECTED (→ 404) when the profile isn't served. The resolved profile
  is stamped onto the SessionSource.
- session-key namespacing and the per-turn home/credential scope now prefer
  source.profile: SessionStore._resolve_profile_for_key(source),
  _session_key_for_source fallback, and _resolve_profile_home_for_source all
  honor it (→ the agent turn resolves that profile's config/skills/credentials
  via the Phase 2 _profile_runtime_scope).

Constraint: routing inbound needs no per-profile platform credential, but the
agent still needs the routed profile's provider key — delivered by Phase 2's
secret scope. api_server (OpenAI-compatible surface) profile routing is a
focused follow-on; its source-construction path differs from webhook's.

Tests: SessionSource.profile round-trip + namespace drive; _resolve_request_
profile accept/reject/ignore matrix.
2026-06-19 07:34:15 -07:00
Ben Barclay
f538470cf4 feat(gateway): multiplex phase 2 — fail-closed profile credential isolation (Workstream A)
The credential gate. When multiplexing is active, a profile's secrets resolve
from a context-local scope, never the process-global os.environ (which in a
multiplexer may hold another profile's keys, and is inherited by every
subprocess spawned with env=dict(os.environ)).

- agent/secret_scope.py: get_secret() backed by a secret-scope contextvar.
  FAIL-CLOSED: when multiplex is active and no scope is installed, an unscoped
  read RAISES UnscopedSecretError instead of falling back to os.environ — a
  missed/new call site crashes loudly at that line rather than leaking a
  cross-profile value. Genuinely-global vars (HERMES_*, PATH, kanban paths,
  …) keep reading os.environ via an allowlist. load_env_file/build_profile_
  secret_scope parse a profile .env into an isolated dict WITHOUT mutating
  os.environ. Off by default => transparent os.getenv behavior.
- hermes_cli/runtime_provider.py: all credential/provider/base-url reads go
  through _getenv -> get_secret.
- agent/credential_pool.py: env fallbacks route through get_secret (the
  ~/.hermes/.env-first preference is preserved and already profile-correct via
  the home override).
- tools/mcp_tool.py: MCP config  interpolation resolves through
  get_secret, so a server's  picks up the routed profile's value.
- gateway/run.py: set_multiplex_active() at GatewayRunner init; per-turn .env
  reload is a no-op for credentials in multiplex mode (secrets come from the
  scope, not global env); _profile_runtime_scope context manager combines the
  HERMES_HOME override + secret scope; _run_agent wraps _run_agent_inner in
  that scope (resolved via _resolve_profile_home_for_source) when multiplexing.

Propagates into the agent worker thread for free via the existing
copy_context() in _run_in_executor_with_context.

Tests: 13 unit (fail-closed, scope isolation, global allowlist, .env parsing
without environ mutation) + 7 E2E (runtime_provider + MCP interpolation prove
two profiles isolated, unscoped read raises, globals still read environ).
2026-06-19 07:34:15 -07:00
Ben Barclay
d82f9fa7f7 feat(gateway): multiplex phase 0 — config flag, profile enumeration, profile-stamped session keys
Foundations for serving multiple profiles from one gateway process, inert
when off:

- gateway.multiplex_profiles config flag (default false), round-trips through
  GatewayConfig and load_gateway_config (top-level + nested gateway.* form).
- hermes_cli.profiles.profiles_to_serve(multiplex): the single chokepoint for
  which (profile, HERMES_HOME) pairs the gateway serves. Lightweight dir scan;
  active-profile-only when off, default + all named profiles when on.
- build_session_key gains a profile= namespace slot. Default/None reuse the
  historical 'agent:main:...' literal BYTE-IDENTICALLY (no session migration,
  positional parsers unaffected); a named profile becomes 'agent:<profile>:...'
  so two profiles on the same platform/chat never collide.
- SessionStore._resolve_profile_for_key + _session_key_for_source fallback
  resolve the namespace from the flag (legacy when off, active profile when on).

Tests: byte-identical-when-off (parametrized), namespace isolation, positional
layout preserved, config round-trip, profiles_to_serve enumeration.
2026-06-19 07:34:15 -07:00
teknium1
df2420f571 fix(gateway): keep non-Discord home-channel startup send byte-identical
The salvaged non_conversational marking made the home-channel startup
no-metadata branch always pass metadata= explicitly; for non-Discord
platforms _non_conversational_metadata returns None, so Telegram/etc.
went from adapter.send(chat_id, message) to adapter.send(..., metadata=None).
Behaviorally identical but broke test_restart_notification's exact
assert_called_once_with. Only attach metadata when the marker applies
(Discord), restoring the original call shape elsewhere.
2026-06-19 07:29:27 -07:00
snav
caaa916289 fix(gateway): don't let delayed Discord status messages partition history backfill
Discord channel-history backfill partitions on Hermes' last self-authored
message. Asynchronous, non-conversational status sends (self-improvement
review bubbles, heartbeats, background-process notifications, update status,
gateway restart/online notices) land as ordinary bot messages, so a delayed
status bump becomes the history boundary and swallows real messages that
arrived after Hermes' actual reply.

Mark these sends at the source via metadata["non_conversational"] (Discord
only; other platforms' metadata is unchanged). The adapter no longer advances
the history-boundary cache for marked sends and persists their IDs to a
sidecar JSON so the cold-start scan can skip them by ID after a restart. A
narrow regex recognizer remains only as an upgrade bridge for status bumps
emitted by an older gateway that pre-dates the marking.
2026-06-19 07:29:27 -07:00
Alex Yates
fad4b40d9d fix(model): persist /model switch by default across sessions
A plain /model <name> switch only lasted for the current session — every
new session reverted to the previously-configured model, so users had to
re-switch every time (e.g. glm-5.1 -> glm-5.2 on every launch).

Persist-by-default is now the behavior across all three /model surfaces
(CLI, gateway, TUI/dashboard), gated by a new config key
model.persist_switch_by_default (default true):

  /model <name>             switch model (persists to config.yaml)
  /model <name> --session   switch for this session only
  /model <name> --global    switch and persist (explicit, unchanged)

The effective persistence is resolved once via resolve_persist_behavior()
in hermes_cli/model_switch.py so --session opts out, --global opts in,
and the config-gated default applies otherwise. --global remains a valid
explicit no-op alias for the new default.
2026-06-19 07:07:06 -07:00
Charles Power
715fa9ea1c fix(gateway): harden gateway command-line matcher (review findings)
Address correctness gaps found in pre-PR review of the strict matcher:

- Profile selectors can appear on EITHER side of the `gateway` token
  (`_apply_profile_override` strips `--profile`/`-p` from anywhere in argv
  before argparse), so `hermes gateway --profile work run` and
  `python -m hermes_cli.main gateway -p work run` are valid launches the
  previous matcher wrongly rejected. Strip `--profile`/`-p`/`--profile=`/`-p=`
  from anywhere before locating the subcommand.
- A profile literally named `gateway` (`hermes -p gateway gateway run`) made
  the old token scan stop on the profile value; stripping the selector+value
  first fixes it.
- Tokenize quote-aware with `shlex` so quoted Windows paths containing spaces
  (`"C:\Program Files\Hermes\hermes-gateway.exe"`) are no longer split mid-path
  and the dedicated-entrypoint match survives.

Without these, the matcher could MISS a real running gateway -> the opposite
failure (restart/status reporting "down" when up). Adds regression tests for
all three shapes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 06:31:56 -07:00
Charles Power
fd92a3a5c9 fix(gateway): Windows restart no longer causes a silent outage
`hermes gateway restart` on Windows could take the gateway offline with no
replacement. restart() was stop() -> sleep(1.0) -> start(), but the graceful
drain can run up to ~180s while the detached pythonw process stays alive. The
1s sleep let start() run against the still-draining old process; its
"already running" guard then no-opped, and when the old process finally exited
nothing relaunched it.

Two root causes, both fixed:

1. Loose PID detection. `_scan_gateway_pids` and the gateway.status helpers
   used substring matches ("... gateway" in cmdline) for lifecycle decisions,
   so they false-matched `gateway status`/`dashboard` siblings and unrelated
   processes like `python -m tui_gateway`, plus stale gateway.pid records.
   Add a shared strict matcher `looks_like_gateway_command_line()` in
   gateway/status.py that requires the real `gateway run` subcommand (or the
   dedicated entrypoints), and route `_looks_like_gateway_process`,
   `_record_looks_like_gateway`, and `_scan_gateway_pids` through it.

2. restart() race. Wait until the gateway is authoritatively gone
   (`get_running_pid()` + strict `_gateway_pids()`) before relaunch; force-kill
   once if it lingers and raise rather than start a duplicate; verify the
   relaunch produced a running gateway and raise loudly if not (no more
   exit-0 silent outage).

Scoped to Windows; systemd/launchd restart paths are already drain-aware.
Adds tests/gateway/test_gateway_command_line_matcher.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-19 06:31:56 -07:00
infinitycrew39
ca92e9a362 fix(gateway): refresh cached agent max_iterations from current config
When a gateway agent is reused from cache, it retains the max_iterations
from its initial creation. If config.yaml agent.max_turns or HERMES_MAX_ITERATIONS
changed between turns, the cached agent's budget becomes stale.

Before reusing a cached agent, refresh agent.max_iterations from the
freshly-resolved value (read from env/config at line 14585).

Fixes partial issue from PR #48127: handles fresh agent creation + cached agent reuse.
2026-06-19 06:31:13 -07:00
infinitycrew39
460b1e50e5 fix(gateway): refresh max_turns before resolving runtime budget 2026-06-19 06:31:13 -07:00
Ben Barclay
a64fc490fe
fix(relay): make hosted gateways actually connect AND complete the inbound/outbound round-trip (#48828)
* fix(relay): enable RELAY platform + normalize dial URL so hosted gateways actually connect

Three bugs blocked a self-provisioned hosted gateway from ever establishing its
inbound relay WS (found while standing up the live staging end-to-end). Each
masked the next; all three are needed for inbound to work.

1. RELAY platform never enabled in config.platforms (gateway/config.py).
   register_relay_adapter() puts the adapter in the platform_registry, but
   start_gateway()'s connect loop iterates self.config.platforms — which never
   contained Platform.RELAY. So the adapter was "registered" but never connected
   (logs showed "relay adapter registered" then "No messaging platforms
   enabled"). Fix: _apply_env_overrides now enables Platform.RELAY (mirroring
   relay_url into extra for the connected-checker) when GATEWAY_RELAY_URL (env)
   or gateway.relay_url (yaml) is set. Absent -> no RELAY entry (direct/
   single-tenant gateways unaffected).

2. URL scheme not converted for the WS dial (gateway/relay/ws_transport.py).
   The relay URL is configured once as the http(s):// base (used as-is for the
   provision POST), but websockets.connect rejects http(s):// with "scheme isn't
   ws or wss". Fix: _ws_dial_url converts https->wss / http->ws.

3. /relay path not appended (same helper). The connector mounts its
   WebSocketServer at path "/relay" and returns HTTP 400 on an upgrade to any
   other path. GATEWAY_RELAY_URL is the base (no /relay), so the dial hit "/"
   -> 400. Fix: _ws_dial_url ensures the path ends in /relay. Idempotent — a URL
   already carrying ws(s):// and/or /relay is unchanged, so provision's
   _provision_url (which derives /relay/provision from either form) still works.

Why the cross-repo E2E missed #2/#3: the stub connector binds ws://host:port and
its websockets.serve accepts ANY path, so neither the scheme nor the /relay path
was exercised. Real connector needs both.

Verified live on staging hermes-agent-stg-automated-perception-5054: after the
fixes the gateway logs "Connecting to relay..." -> "✓ relay connected" ->
"Gateway running with 1 platform(s)" against
wss://gateway-gateway.staging-nousresearch.com/relay, stable.

Tests: added _ws_dial_url scheme+path+idempotency cases (test_ws_transport.py)
and RELAY-platform-enablement cases for env + yaml + absent (test_config.py).
Full gateway/relay + config suites green (191 passed).

Relay-adapter lane. EXPERIMENTAL.

* fix(relay): re-attach guild_id to outbound so connector egress resolves the tenant

The final bug in the hosted-relay round-trip. Inbound worked end to end (Discord
-> connector -> bus -> agent WS -> agent runs -> reply), but the reply's egress
was declined by the connector: "discord egress declined: target not routed to an
onboarded tenant".

Cause: the connector's routedEgressGuard resolves the owning tenant from the
OUTBOUND action's metadata.guild_id (Discord's routing discriminator). The
gateway's generic delivery path builds outbound metadata via
run.py _thread_metadata_for_source, which only carries thread_id (and returns
None entirely for a non-threaded message) — so guild_id never reached the
connector, tenant resolution failed, and the shared bot refused to post.

Fix (relay-adapter-local, no perturbation of the generic delivery path or other
platforms): RelayAdapter learns chat_id -> guild_id from each inbound event
(_capture_scope) and re-attaches it to the outbound action's metadata in send()
(_with_scope) when not already present. No-op for chats we never saw inbound
(e.g. DMs) and never overwrites an explicit guild_id.

Verified live on staging hermes-agent-stg-automated-perception-5054: an
@mention in #general now produces a visible bot reply — full multi-tenant relay
round-trip (real Discord -> shared connector bot -> tenant routing -> agent WS ->
reply egress -> Discord).

Tests: _capture_scope/_with_scope reattach, no-scope no-op, explicit-guild_id
preserved (test_relay_adapter.py). Full relay + config suites green (160 passed).

Relay-adapter lane. EXPERIMENTAL.
2026-06-19 16:30:24 +10:00
Ben
637aff46e7 Merge remote-tracking branch 'origin/main' into hermes/hermes-6fe26723 2026-06-19 15:17:13 +10:00
Ben Barclay
2c6e266e88
fix(relay): trigger self-provision on relay-config + NAS token, not is_managed() (#48724)
self_provision_if_managed() gated on is_managed(), but is_managed() means
"NixOS/package-manager-managed" (it keys on HERMES_MANAGED or a ~/.hermes/.managed
marker) — NOT "NAS-hosted". A NAS-provisioned Fly agent sets NEITHER, so the gate
was always False and relay self-provision SILENTLY no-oped on exactly the hosted
agents it was built for. Caught live: a staging agent with GATEWAY_RELAY_URL
correctly stamped logged "No messaging platforms enabled" and never dialed the
connector; HERMES_MANAGED was unset on the machine. The unit tests had mocked
is_managed()->True, so they passed while the real trigger never fired (mocked-
trigger blind spot).

Fix: drop the is_managed() gate and rename self_provision_if_managed ->
self_provision_relay. The real trigger is now "relay_url() set + no pinned secret
+ a resolvable NAS token", which is both NAS-independent and self-guarding:
  - NAS-hosted agent: GATEWAY_RELAY_URL + no pinned secret + bootstrapped NAS
    token -> self-provisions.
  - Self-hosted + `hermes gateway enroll`: pinned GATEWAY_RELAY_SECRET -> skipped
    (existing secret-present guard).
  - Self-hosted, unenrolled, no NAS identity: resolve_nous_access_token() fails
    -> graceful no-op (existing fail-soft path).

Security: unchanged trust model. The connector still derives tenant from the
validated NAS token; this only broadens WHEN the provision attempt fires, and
every broadened case is still guarded by token-resolution + pinned-secret-skip.

Tests: replaced the (wrong) "skips when not managed" test with a regression test
proving a NAS host where is_managed()==False STILL provisions; renamed all call
sites; added a "no NAS token -> non-fatal skip" test for the self-hosted branch.
88 relay tests pass.

Relay-adapter lane. EXPERIMENTAL.
2026-06-19 01:01:24 +00:00
Ben Barclay
d2c53ff558
feat(relay): WS-only inbound on the gateway adapter (Phase 3) (#48294)
The connector now delivers inbound (messages + interrupts) over the gateway's
OUTBOUND /relay WebSocket, not a signed HTTP POST to an inbound endpoint. The
gateway needs no inbound HTTP port — which is what makes hosted gateways (no
public IP) able to receive inbound at all.

- gateway/relay/adapter.py: connect() wires set_interrupt_inbound_handler(
  self.on_interrupt) so connector->gateway interrupt_inbound frames bridge into
  the existing per-session interrupt path (the inbound message handler was
  already wired). Removed _maybe_start_inbound_receiver() + the _inbound_runner
  lifecycle — there is no HTTP receiver anymore.
- gateway/relay/inbound_receiver.py: deleted (the signed-HTTP InboundDelivery
  receiver).
- gateway/relay/__init__.py: removed relay_inbound_config() (dead with the
  receiver gone). The delivery key is still set in-process by self-provision for
  forward-compat but is no longer consumed for inbound.
- docs/relay-connector-contract.md: §3 rewritten — inbound is the WS back-channel
  routed cross-instance via the connector's relay bus; §5 interrupt + §6 auth
  table updated; the old signed-HTTP-POST + per-tenant-delivery-key-signing path
  is documented as superseded. gatewayEndpoint noted as passthrough-plane only.

Tests: stub_connector grows set_interrupt_inbound_handler + push_interrupt;
new test_relay_interrupt case proves connect() wires BOTH inbound handlers and an
interrupt_inbound frame over the WS cancels the right session. Removed the
HTTP-receiver test; updated the crypto-shedding scan + self-provision delivery-key
assertion. 88 relay tests pass.

EXPERIMENTAL. Pairs with gateway-gateway (relay bus + WsGatewayDelivery) and the
NAS GATEWAY_RELAY_URL stamp. The cross-repo E2E (connector repo) proves the full
multi-instance path against this production adapter code.
2026-06-19 09:33:15 +10:00
teknium1
0879d5cc8f fix(gateway): preserve original transcript when /compress rotation is skipped
The manual /compress handler called rewrite_transcript() unconditionally on
the session id returned by _compress_context(). When rotation does not occur
(e.g. _session_db unavailable, or the DB split raised), session_id is unchanged
and rewrite_transcript() DELETEs the original messages and replaces them with
only the compressed summary — permanent data loss (#44794, #39704).

Guard the rewrite on actual rotation: only overwrite when _compress_context
produced a new session id. Otherwise leave the original transcript intact and
log a warning.
2026-06-18 13:38:35 -07:00