After a prolonged outage the in-process network-error ladder escalates to
fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter
that reconnects through the bootstrap path. That path called
start_polling(drop_pending_updates=True), discarding every update Telegram
queued during the outage — all messages sent while the bot was down were
silently lost. The in-process ladder and 409-conflict handler already passed
drop_pending_updates=False; only bootstrap did not distinguish a cold first
boot from a reconnect.
Thread an is_reconnect signal from the watcher through
_connect_adapter_with_timeout into adapter.connect(). The base
BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every
adapter inherits a tolerant signature (no per-platform breakage when the
runner forwards the kwarg). Telegram translates is_reconnect into
drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap
calls. Cold boot still drops the stale queue; a watcher reconnect preserves it.
Fixes#46621.
Co-authored-by: annguyenNous <annguyen@nousresearch.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>
A readable state.db can still reject every message write through the
messages_fts* triggers when the FTS5 index is corrupt: base-table reads and
PRAGMA integrity_check pass, but INSERT INTO messages fails with 'database
disk image is malformed'. The gateway reloads conversation_history from disk
each turn, so a silently-failed write hands the next turn stale/empty history
even though the same cached AIAgent still holds the live transcript — causing
immediate same-session amnesia. (#50502)
- hermes_state.py: _db_opens_cleanly() now drives a rolled-back message write
through the FTS triggers, so write-only corruption (which the read-only
probe reported healthy) is detected. repair_state_db_schema() gains an
in-place FTS5 'rebuild' strategy (tier 0) before the dedup/drop tiers, plus
an already_healthy short-circuit. Both 'hermes sessions repair' and
'hermes doctor' route through these, so the fix covers the whole class.
- hermes_cli/doctor.py: the state.db check runs the write-health probe even on
the success (readable) path and repairs in place with --fix.
- gateway/run.py: _select_cached_agent_history() prefers the cached agent's
longer live _session_messages over a shorter persisted transcript, so an
FTS write failure can't wipe in-session context.
- tests: regressions for write-health detection, in-place repair preserving
rows + resuming writes, the already_healthy shortcut, and the gateway guard.
Combines the approaches from #50504 (@0-CYBERDYNE-SYSTEMS-0, issue author),
#52165 (@davidgut1982), and #50576 (@trevorgordon981).
The scale-to-zero idle watcher never started on a correctly-opted-in,
relay-only instance, so the gateway never ran its idle decision, never called
go_dormant(), and never sent going_idle to the connector. Fly's autostop still
suspended the machine on traffic-idle, but the connector never flipped the
instance to buffered-only — so an inbound DM took the live delivery path,
found no live session for the suspended machine, and was dropped fail-closed
with no wake poke. The machine slept and never woke.
Root cause: _scale_to_zero_should_arm() passed list(config.platforms.keys())
to messaging_is_relay_only_or_absent(). config.platforms is pre-seeded with a
DISABLED placeholder PlatformConfig for every known platform (telegram,
discord, slack, matrix, …), so the key set is always the full ~20-entry
catalog regardless of what the instance actually runs. The relay-only check
discarded "relay", saw the disabled placeholders as live direct-socket
platforms, and returned False — so should_arm() was False and the watcher was
never created. Verified live on a staging instance: config.platforms keys =
[telegram, discord, slack, mattermost, matrix, relay] with only relay
enabled=True; should_arm() = False.
Fix: filter config.platforms to ENABLED entries before the relay-only check,
mirroring the adapter-connect loop which already gates on
`if not platform_config.enabled: continue`. This arms off the same notion of
"active platform" the rest of start() already uses — no parallel concept.
Also add a one-line not-armed diagnostic: when an instance IS opted in (the
HERMES_SCALE_TO_ZERO stamp is set) but the watcher still doesn't arm, log why
(relay_only_or_absent, the enabled platforms, wake_url present/missing). A
non-opted instance stays silent. The arm path previously logged only on
success, so a failed arm was invisible.
Tests: the existing pure-helper tests passed bare names so they never
exercised the call site that feeds the placeholder-laden config. Add
behaviour-contract tests against the REAL _scale_to_zero_should_arm with a
realistic config.platforms (relay enabled + others disabled). The F25
regression test (relay-only + disabled placeholders must arm) and the
no-platform case are RED without this fix, GREEN with it; the
genuinely-enabled-direct-platform / not-opted-in / no-wake-url cases stay
correctly non-arming so the filter can't over-broaden.
Wake mechanism itself verified healthy independently (direct wakeUrl GET
resumed a suspended staging instance in 1.15s, clean resume signature).
The #45966 cross-process coherence guard popped the stale cached agent
and then called the blocking _cleanup_agent_resources (memory-provider
shutdown, tool-resource teardown, async-client teardown) while still
holding _agent_cache_lock, on the gateway event-loop thread. While that
ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher)
blocked acquiring the same lock and the asyncio loop stalled for minutes,
tripping repeated Discord 'heartbeat blocked' warnings.
Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry
under the lock, release it, then schedule the SOFT release on a daemon
thread. The soft path (_release_evicted_agent_soft) is also more correct
here than the hard teardown the regression used — the same session
rebuilds a fresh agent immediately after invalidation, so its terminal
sandbox / browser / bg processes (keyed on task_id) must be preserved
for the rebuilt agent to inherit, not torn down.
Verified the cross-process site was the only cleanup-under-lock instance;
the other _cleanup_agent_resources call sites run outside the lock.
* feat(moa): expose MoA presets as selectable virtual models
Reconstructed onto current main (PR #46081's base had diverged with no common
ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual
provider: each named preset is a selectable model under provider 'moa', and the
preset's aggregator is the acting model that answers and calls tools.
Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same
batch pattern delegate_task uses) — all references dispatched at once, collected
when every one finishes, then handed to the aggregator. Output order is
preserved, failures and the MoA-recursion guard stay isolated per reference.
- Removed the old mixture_of_agents model tool and moa toolset.
- Added moa as a virtual provider in the provider/model inventory.
- /moa is shortcut behavior over model selection (default preset / named preset
/ one-shot prompt).
- Dashboard + Desktop manage named presets; presets appear in model pickers.
- Parallel reference fan-out in agent/moa_loop.py with regression test.
* fix(moa): thread moa_config through _run_agent to _run_agent_inner
The reconstructed gateway MoA wiring declared moa_config on _run_agent (the
profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper
never forwarded it — _run_agent_inner had no such parameter, so the runtime hit
NameError: name 'moa_config' is not defined on the compression-failure session
sync path. Add moa_config to _run_agent_inner's signature and forward it from
both wrapper call sites (multiplex and non-multiplex). Caught by
tests/gateway/test_compression_failure_session_sync.py on CI shard test(4).
* fix(moa): classify moa as a virtual provider in the catalog
The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so
provider_catalog() fell through to the default auth_type="api_key" with no
env vars — tripping two catalog invariants:
- test_provider_catalog: api_key providers must expose a credential env var
- test_provider_parity: every hermes-model provider must be desktop-configurable
moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that
overlay as an auth_type fallback so the catalog reports moa as virtual (no real
credential, no network endpoint). Exempt virtual providers from the desktop
parity union check the same way 'custom' is exempt — derived from the catalog,
not a hardcoded slug, so future virtual providers are covered too.
Salvage of #50098 by @srojk34, cherry-picked onto current main.
The hygiene auto-compress guard and the /compress slash command both read
compression_in_place (config flag — is in-place mode enabled?) instead of
_last_compaction_in_place (result flag — did in-place compaction actually
succeed?). Both agents are built without a session_db, so archive_and_compact
always fails silently and _last_compaction_in_place stays False. Reading the
config flag makes the guard think in-place succeeded, triggering
rewrite_transcript() which replaces the original messages with only the
compressed summary — permanent data loss.
Co-authored-by: srojk34 <srojk34@users.noreply.github.com>
- Replace getattr(self.session_store, '_db', None) with self._session_db
(the GatewayRunner's own SessionDB, consistent with existing usage in
slash_commands.py L240/L499).
- Remove verbose comment referencing a branch name as an issue number.
- Update stale comment in run.py that said 'today it has no session_db'.
- Add regression test verifying session_db is passed and rotated session
is persisted (adapted from #51624 by @LeonSGP43).
- Add _session_db=None to _make_runner fixtures in test_compress_command,
test_compress_focus, and test_compress_plugin_engine.
Manual /compress and session hygiene auto-compress both create temporary
AIAgent instances to run compression. These agents were created without
a session_db, so compress_context computed the compressed messages in
memory, rotated the session ID, and reported success — but never wrote
to the database. The next user message reloaded the original full
transcript, making compression appear to do nothing.
Fix: pass session_db=self.session_store._db to both temp agents so the
session rotation is properly persisted. Also set _end_session_on_close
on the /compress temp agent (already done in hygiene path) to prevent
cleanup from ending the newly rotated session.
When the gateway persists a user message after a transient provider
failure (429/timeout/auth error), subsequent retries of the same
Telegram message could stack duplicate user turns in the transcript,
causing the agent to fall behind by 1-2 messages.
Add has_platform_message_id() to SessionDB (using the existing
idx_messages_platform_msg_id partial index) and a SessionStore wrapper.
The gateway's transient-failure path checks this before
append_to_transcript -- if the platform_message_id is already
persisted, the duplicate write is skipped.
Salvaged from #47869 by @davidgut1982. Adapted to current main which
has additional append sites and an existing content-based dedupe in
the exception handler path.
Closes#47237
Block scale-to-zero suspend while background async delegations are active, and restore runtime status to running on real inbound after a dormant wake.\n\nAdd regression coverage for both review findings.
The gateway-side BEHAVIOUR layer that consumes the relay scale-to-zero
primitives (gateway-gateway Phase 5): the gateway decides it is idle and
drives the relay transport dormant so the platform (Fly autostop:"suspend")
can suspend the now-traffic-idle machine, which wakes on the connector's
wakeUrl poke (decisions.md Q3=C', D1-D13).
- gateway/scale_to_zero.py: pure helpers — scale_to_zero_enabled (the NAS
Labs HERMES_SCALE_TO_ZERO stamp, D11/Q8=A), parse_idle_timeout_seconds
(config.yaml gateway.scale_to_zero.idle_timeout_minutes, D2),
messaging_is_relay_only_or_absent (F6/D1), should_arm (D1/D11/§3.4(1)),
is_idle (D2/D3/F7).
- gateway/run.py: _last_inbound_at clock stamped on user inbound in
_handle_message (F13); the arm-gate + idle predicate + the
_scale_to_zero_watcher dormant sequence (mark draining -> adapter
go_dormant() -> cooldown), started only when armed. Deliberately NOT the
stop path and NOT mark_resume_pending (F12/D13).
- tools/process_registry.py: has_any_active() for the bg-work guard (D3/F7).
- hermes_cli/config.py: gateway.scale_to_zero.idle_timeout_minutes default 5.
Tests: 38 pure-logic + 6 watcher (incl. bg-work regression guard proven RED).
Full relay + scale-to-zero suites: 184 passed. The 20 unrelated failures in
the broader run are PRE-EXISTING on origin/main (custom-provider/tools tests),
confirmed via a pristine baseline worktree.
Add compression.minimum_context_floor config key that allows users
to lower the compression threshold floor below the hardcoded 64K
default, preventing infinite tool-call loops on models whose
structured output degrades well before 64K tokens.
- agent/model_metadata.py: add get_configurable_minimum_context()
helper with 16K hard safety limit
- agent/context_compressor.py: accept minimum_context_floor param,
thread it through _compute_threshold_tokens
- agent/conversation_compression.py: use compressor's floor for
aux model context validation
- agent/agent_init.py: read compression.minimum_context_floor from
config and pass to ContextCompressor
- gateway/run.py: cache-busting includes new key
Salvaged from #31686 by @Tranquil-Flow onto current main.
Resolves conflicts with in-place compaction (#38763) and max_tokens
threshold computation (#43547) that landed after the original PR.
Closes#31600
After /stop, the next user message can hit a stale generation token and
return with api_calls=0, no failure, no interruption. _normalize_empty_agent_response
fell through to an empty string, so the gateway logged "response=0 chars"
and sent nothing — the message was silently lost while internal work
sometimes continued.
Add the api_calls==0 / not-failed / not-interrupted / not-partial branch
to the single normalization chokepoint so the user gets a short retry hint
instead of silence. Regression test asserts the hint surfaces.
Salvaged from #33851 (re-applied on current main; original was 1401 commits
behind and the function had moved).
When `/model X` is the FIRST message after an idle/daily/suspended auto-reset,
the slash-command path stores a session model override but leaves
`session_entry.was_auto_reset = True` (it never passes through
`_handle_message_with_agent`, which is where the flag was consumed). On the
NEXT regular message, the auto-reset cleanup block pops the freshly-stored
model/reasoning override BEFORE the flag is consumed — so the switch is
silently lost and resolution falls back to the config default, while the
session DB still shows the switched model (a two-sources-of-truth divergence).
Consume the flag at both sites:
1. gateway/run.py — capture `was_auto_reset` into a local and set the
attribute False immediately at the top of the cleanup block, so the
cleanup can't re-fire on a later message and wipe an override stored
between turns. Downstream reads use the captured local.
2. gateway/slash_commands.py — the model path consumes the flag before
storing the override, so a /model-first-after-auto-reset isn't wiped by
the next message's cleanup.
Salvaged from #48062 by x7peeps (authorship preserved).
Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST
invariants pinning both consume sites (load-bearing; verified they fail when
either consume is removed). Mirrors the AST-pin approach in
test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed.
Fixes#48031
The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked
for more than N seconds') because _handoff_watcher polled the synchronous,
blocking SQLite-backed SessionDB directly on the asyncio event loop every
2s. Each list_pending/claim/complete/fail call performed blocking disk I/O
on the loop thread, starving the Discord heartbeat coroutine.
Wrap every blocking SessionDB call inside the watcher loop in
asyncio.to_thread(...) so the SQLite work runs on a worker thread and the
event loop (and heartbeat) stays responsive. These four call sites are the
only synchronous self._session_db.* calls inside the watcher loop body.
Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher
offloads its SessionDB calls via asyncio.to_thread (mutation-survivable:
reverting any to_thread wrap fails the corresponding assertion).
Fixes#40695
Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.
Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:
- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
received); on a 4401 close after a prior success, latch `auth_revoked` and do
NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
stays retryable (cold-start / not-yet-provisioned race, not a revocation).
New `auth_revoked` property + a websockets-version-safe close-code reader
(prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
handler (so the adapter is removed and NOT queued for reconnection — the
credential is dead until the instance is recreated). Monitor is cancelled on
disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
`disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
a PowerOff icon, and muted (not destructive-red) text + message. New optional
`status.disabled` i18n string ("Disabled").
Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.
Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.
Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
The contributor PR stamped runner._exit_code=78 on non-retryable startup
errors, but start_gateway()'s clean-exit branch returned True before the
SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish
script's [ "$1" = "78" ] check never matched and s6 crash-looped the
gateway anyway — the fix was dead as shipped (#51228).
Honor runner.exit_code in the clean-exit branch: raise SystemExit(code)
when set, else return True (normal /restart clean exit). Add a
start_gateway()-level test that asserts process-level SystemExit(78)
propagation — the gap the PR's object-level test missed — plus exit_code
on the existing _CleanExitRunner mocks.
Profiles without their own messaging token inherit the default
profile's token via os.getenv, hit a token collision, and exit with
startup_failed. s6 restarts them immediately, creating ~30MB tirith
sandbox dirs in /tmp each cycle — filling the disk in hours (#51228).
Changes:
- gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78
- gateway/run.py: set exit_code=78 on non-retryable startup errors
(token collision, no platforms)
- hermes_cli/service_manager.py: add _render_finish_script() that
translates exit 78 → exit 125 (s6 permanent failure)
- hermes_cli/container_boot.py: write finish script alongside run
script during profile registration
The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish.
Closes#51228
Open-ended skill learning across every surface. /learn <free text> takes a
description of any source — a directory, a URL, the workflow you just walked
the agent through, or pasted notes — and the live agent gathers it with the
tools it already has (read_file/search_files, web_extract, the conversation,
the pasted text), then authors a SKILL.md via skill_manage following the
house authoring standards (<=60-char description, the standard section order,
Hermes-tool framing, no invented commands).
No engine, no model-tool footprint, works on any terminal backend (local,
Docker, remote): /learn builds a standards-guided prompt and hands it to the
agent as a normal turn.
- agent/learn_prompt.py: shared standards-guided prompt builder
- /learn registry entry (both surfaces) + CLI handler (inject onto input
queue) + gateway handler (rewrite turn, fall through, /blueprint pattern)
- tui_gateway command.dispatch returns a send directive -> TUI + dashboard chat
- dashboard Skills page 'Learn a skill' panel (dir + URL + open-ended text)
composes a /learn request and runs it in chat
- docs (slash-commands ref + skills feature page), 11 targeted tests
Inspired by OpenAI Codex's Record & Replay and the /learn concept from #47234
(dir-distillation engine); reworked to be open-ended and engine-free per
review.
Adds a per-platform display.reasoning_style setting (code | blockquote |
subtext) controlling how the show_reasoning summary renders on the gateway.
Discord defaults to "subtext" (-# small grey metadata text); every other
platform keeps the fenced code block. Resolves through the existing
display.platforms.<platform>.reasoning_style override chain.
The gateway half of Phase 6 Unit ζ: project the agent's existing relevance
knobs into the connector's platform-agnostic vocabulary and declare them at boot
over the /relay/policy route, so the SAME mention-gating / free-response /
allow-bots behavior the agent applies directly also governs relay delivery (and
excluded chatter never wakes a scaled-to-zero agent).
- gateway/relay/__init__.py:
- relay_relevance_policy(): project require_mention -> requireAddress,
free_response_channels -> freeResponseScopes, {PLATFORM}_ALLOW_BOTS in
{mentions,all} -> allowOtherBots. Reads the fronted platform's config block
+ bridged top-level keys. Returns None when all-default (the connector's
quiet default already matches) or no concrete platform is fronted.
- send_relay_policy(): POST /relay/policy authenticated with the gateway's own
per-gateway upgrade token (make_upgrade_token — same bearer as the WS
upgrade), so the connector attaches it to the authenticated instance, never
a body-asserted id. Re-declares every boot (self-healing, full replace).
NEVER raises, NEVER blocks boot — relevance is an optimization layered on
the δ/ε authorization gate. Reuses the per-gateway secret + the
/relay/provision host; no new inbound surface, no new credential.
- _policy_url(): ws(s)://…/relay -> http(s)://…/relay/policy.
- gateway/run.py: call send_relay_policy() after register_relay_adapter()
succeeds (the secret is resolved by then).
- docs/relay-connector-contract.md: new §7 documenting per-instance delivery +
the management plane (/manage/* + /relay/policy) + the relevance-declaration
contract; versioning renumbered to §8. Contract conformance test stays green
(§2/§3 tables untouched).
Tests: +12 (projection mapping incl. comma-string + top-level fallback; send
auth/skip/fail-soft/non-200). Full relay suite 118 pass. The connector route is
already E2E-proven (connector repo gateway_policy_driver.py); this adds the real
gateway send-path it pairs with.
This completes Phase 6 (Team Gateway per-user isolation) end to end.
* feat(goals): add /goal wait <pid> barrier to park the loop on a background process
The /goal loop re-pokes the agent every turn via the post-turn judge. When a
goal is gated on a long-running background process (CI poller, build, test
matrix, deploy) that produces nothing to judge yet, this spins the agent into
'is it done?' busy-work and burns the turn budget.
/goal wait <pid> [reason] parks the loop: while the PID is alive, the judge is
skipped, no turn is consumed, no continuation fires, and /goal status shows a
parked indicator. The barrier auto-clears the moment the process exits (the
agent's notify_on_complete watcher is the natural wake signal), then the next
turn resumes normal judging. /goal unwait clears it manually; pause/resume/clear
drop it; a dead/stale PID can never wedge the loop.
Wired across CLI, gateway, and the mid-run command guard for parity. Barrier
persists in SessionDB.state_meta (survives /resume); GoalState gains
backward-compatible waiting_on_pid/waiting_reason/waiting_since fields. 12 new
tests; docs updated.
* fix(goals): use gateway.status._pid_exists for liveness, not os.kill(pid,0)
The Windows-footguns CI guard flagged os.kill(pid, 0) in _pid_alive — on
Windows that's not a no-op, it routes to CTRL_C_EVENT and hard-kills the
target's console process group (bpo-14484). Delegate to the canonical
footgun-safe gateway.status._pid_exists (psutil + ctypes/POSIX fallback)
instead, with a direct-psutil last resort.
* feat(goals): judge-driven auto-wait — the loop parks itself, no manual /goal wait
Makes the wait barrier automatic. Every turn the judge is shown the agent's
live background processes (pid, command, uptime, output tail from the
process_registry) alongside the goal + response, and can return a new 'wait'
verdict instead of continue:
{"verdict":"wait","wait_on_pid":N} → park until that process exits
{"verdict":"wait","wait_for_seconds":N} → park until the deadline passes
evaluate_after_turn acts on the directive (sets the barrier, parks the loop)
so the agent isn't re-poked into busy-work while CI/builds/deploys run. Adds a
time-based waiting_until barrier alongside the pid barrier; both auto-clear and
can never wedge the loop. Drivers (CLI, gateway, tui_gateway) feed the live
registry in via gather_background_processes(). Manual /goal wait stays as an
override. Judge verdict contract widened to (verdict, reason, parse_failed,
wait_directive); legacy {"done":bool} shape still accepted.
* test(goals): update kanban _fake_judge to the 4-tuple judge contract
CI test(3) caught it: test_kanban_goal_mode's _fake_judge still returned the
3-tuple (verdict, reason, parse_failed), but the kanban loop now unpacks the
4-tuple (+ wait_directive). Update the fake to return None for the directive
and accept the background_processes kwarg.
* feat(goals): trigger-based wait — park on a process's own signal, not just exit
Addresses two gaps in the judge-driven wait: (1) the judge could only express
'wait until PID exits' or 'wait N seconds', so a long-lived watcher/server that
fires a trigger MID-RUN (and may never exit) couldn't be waited on; (2) the
process's own watch_patterns/notify_on_complete trigger was invisible to the judge.
Adds a session-based barrier (waiting_on_session) that releases on the process's
OWN trigger via process_registry.is_session_waiting(): the session exits, OR (if
started with watch_patterns) its pattern matches — even while the process keeps
running. list_sessions() now surfaces session_id + watch_patterns/watch_hit/
notify_on_complete so the judge sees the trigger and is told to prefer
wait_on_session for trigger processes. Judge verdict gains a {wait_on_session}
directive (preferred over pid). Backward-compatible GoalState field; pid + time
barriers unchanged.
Tests: TestSessionTriggerBarrier (release on mid-run pattern match while alive,
release on exit, unknown-session, full park→trigger→resume, parse, validation,
backcompat load). 105 goal-surface + 85 process_registry tests green.
Tirith redacts its own findings, but the approval-request callbacks built the
operator prompt from the RAW command string, so a credential-shaped value
Tirith flagged was sent verbatim to clients, undoing the redaction one layer up.
Two egress transports carried the leak; both are fixed via a shared
module-level seam _redact_approval_command() (redact_sensitive_text force=True):
1. chat platforms — _approval_notify_sync (gateway/run.py): redact before
both the button path (send_exec_approval) and the plain-text /approve
fallback.
2. SSE/API stream — _approval_notify (gateway/platforms/api_server.py):
redact event['command'] before it is enqueued to API/desktop clients.
(whole-bug-class: sibling call path on a separate transport.)
force=True so the prompt — a hard secret-egress boundary — honors redaction
even when security.redact_secrets is off. Clean commands pass through unchanged.
Tests bind the seam (synthetic credential-format fixtures, force-when-disabled) AND assert
BOTH callbacks ASSIGN the redacted result before the send/enqueue sink, via an
AST contract that rejects a discarded-result call. All mutation-checked.
Authorization to message the agent is the gate, not the file extension.
Previously the inbound-attachment allowlist (SUPPORTED_DOCUMENT_TYPES) was
opt-OUT on Discord (allow_any_attachment defaulted false) and had no bypass
at all on Telegram/Slack — so an .html (or any non-allowlisted type) was
dropped or hard-rejected before the agent saw it.
Now every authorized upload is cached and surfaced to the agent regardless
of type:
- base.cache_media_bytes(): unknown types cache as octet-stream (or the
caller-supplied MIME) instead of returning None — fixes the chokepoint
that Teams/Telegram-media route through.
- discord/telegram/slack adapters: removed the allowlist reject/skip; any
non-media attachment is typed DOCUMENT and cached. Known types keep their
precise MIME.
- Text inlining now gates on a shared _TEXT_INJECT_EXTENSIONS set (text +
code + config + markup) instead of a blind UTF-8 decode, so binary formats
(PDF/zip/docx) with ASCII headers are never inlined.
- gateway/run.py emits the path-pointing context note for every DOCUMENT,
including non text/application MIME types.
- discord.allow_any_attachment is now a documented no-op kept for config
back-compat.
Validation: 357 gateway tests pass; E2E confirms .html/.bin/custom types
cache, known types stay precise, PDFs are not inlined.
terminal.docker_extra_args passes flags verbatim to `docker run` (e.g.
--gpus=all, --shm-size=16g). It was wired into DEFAULT_CONFIG,
TERMINAL_CONFIG_ENV_MAP (so `hermes config set` bridged it),
terminal_tool._get_env_config (reads TERMINAL_DOCKER_EXTRA_ARGS), and
DockerEnvironment (applies extra_args) -- but it was MISSING from cli.py's
env_mappings and gateway/run.py's _terminal_env_map.
Consequence: a user who hand-edits config.yaml (rather than running
`hermes config set`) has docker_extra_args silently dropped on the CLI and
gateway/desktop startup paths, while docker_image / docker_volumes (which
ARE in those maps) bridge correctly -- producing the reported 'Hermes
partially reads the Docker config' symptom where --gpus=all and
--shm-size=16g never reach docker run.
This is the same bridge-coverage bug class that shipped before for
docker_run_as_host_user (cli + gateway) and docker_mount_cwd_to_workspace
(gateway). Fix by adding the key to both maps, plus a dedicated regression
pin in test_terminal_config_env_sync.py mirroring the existing
test_docker_*_is_bridged_everywhere guards.
Surface dangerous host/deployment posture at gateway startup so operators get
the 'you're exposed' signal the June 2026 MCP-config persistence campaign
victims never had. Warn-only — never blocks startup, never raises.
Checks (each independently fail-safe):
- Running as root (POSIX uid 0)
- SSH daemon with PasswordAuthentication enabled (incl. the 'yes' default)
- Running in a container with no persistent volume mount over HERMES_HOME
- Network-accessible API server with no API_SERVER_KEY
New module hermes_cli/security_audit_startup.py; invoked once per process from
start_gateway() right after setup_logging(). Cross-platform (root/SSH checks
no-op on Windows). Idea: @Cthulhu.
In Telegram streaming, the typing indicator persisted through the slow
final rich-text/MarkdownV2 finalize edit, so the '...typing' bubble
lingered for seconds after the last streamed token. Add a one-shot
on_before_finalize hook to GatewayStreamConsumer, fired once when the
stream transitions into its finalization path, and wire it on both
Telegram streaming call sites to call pause_typing_for_chat() before
the final edit. Cover hook ordering and once-only behavior in tests.
Fixes#49712
* fix(api-server): stop silently promising async delivery on stateless HTTP path
terminal(notify_on_complete=True / watch_patterns) and delegate_task(background=True)
silently no-op'd on the API server / WebUI path (#10760): the watcher / detached
child registered, but every API-server route (OpenAI-spec /v1/chat/completions
and /v1/responses, plus the proprietary /v1/runs SSE stream) tears down its
channel when the turn ends, and APIServerAdapter.send() is a no-op stub. A
completion that fires after the response closed had nowhere to go — from the
agent side, indistinguishable from a hang.
There is no spec-compliant surface to wake the agent later on a stateless HTTP
client, so make the no-op honest instead of silent:
- Add a per-adapter capability flag supports_async_delivery (default True;
APIServerAdapter = False), propagated into a HERMES_SESSION_ASYNC_DELIVERY
contextvar via async_delivery_supported(). Toggle on the adapter, not a
hardcoded platform string — a future stateless adapter is correct-by-default.
- terminal: when delivery is unsupported, skip watcher registration, force
notify_on_complete off, and return a notify_unsupported note telling the
agent to process(action='poll').
- delegate_task: when delivery is unsupported, fall back to SYNCHRONOUS
execution (work runs and returns in the same response) with a note, instead
of handing out a handle that never resolves.
CLI (in-process completion_queue) and the real gateway platforms are unchanged.
Fixes#10760
* refactor(api-server): route session binding through a single no-delivery chokepoint
Add APIServerAdapter._bind_api_server_session() and route both agent-entry
paths (_run_agent for /v1/chat/completions + /v1/responses, and the /v1/runs
_run_sync path) through it. The helper hardwires platform="api_server" and
async_delivery=False with no async_delivery parameter to pass, so a future
route added to the API server physically cannot reintroduce the silent
no-op (#10760) by forgetting to mark the channel as non-delivering.
The binding stays request-scoped (cleared per turn), so a session resumed
later on a delivering interface (CLI / gateway platform) re-binds fresh and
is NOT blocked — the no-delivery decision tracks the interface handling the
current turn, never the session.
A turn forcibly interrupted by the drain-timeout escalation never reaches
turn_finalizer.finalize_turn (the only place that flushes the turn to
state.db). Its in-flight tool rounds live only in the in-memory
_session_messages, so the immediate pre-restart turn was silently dropped
from load_transcript() on resume.
_finalize_shutdown_agents now flushes _session_messages to the SQLite
session store before teardown. The flush is idempotent (identity-tracked
in _flush_messages_to_session_db), so agents that finished gracefully
re-flush nothing. The resume_pending / fresh-tool-tail branches in
_handle_message_with_agent already expect a transcript whose tail may be a
pending tool result.
Fixes#13121.
Funnel session finalization through AIAgent.close() — the single terminal
path every agent (CLI, gateway, subagent, cron) funnels through — so finished
agents stop leaving rows with ended_at IS NULL. The biggest leak source was
delegate_task subagent + background-review forks whose close() never ended
their row.
end_session() is first-reason-wins and no-ops on an already-ended row, so a
'compression'/'cron_complete'/'cli_close' reason set by an earlier terminal
path is never clobbered. /resume already calls reopen_session(), so
finalizing-on-close does not break resumability.
Temporary helper agents that rotate/share the session forward (manual
compression, gateway session-hygiene) opt out via _end_session_on_close=False.
Also stop the long-running gateway heartbeat once the executor is done or the
session slot is rebound to a different agent, preventing a stale
'running: delegate_task' bubble from outliving its run.
Closes#12029.
The gateway pre-compression hygiene valve force-compressed any session
crossing 400 messages regardless of token usage. On large-context (1M+)
models doing many short, message-dense turns, a healthy session at ~16%
token usage could hit 400 messages and get force-compressed — and the
compression summary's stale Active Task could then bleed into the next
turn.
The valve's actual purpose is to break a death spiral: when API calls
keep disconnecting on an oversized session, no token-usage data arrives,
the token threshold never fires, and the transcript grows unbounded.
It's a count-based floor for that pathological case only. 400 was tuned
for ~200K-context models and is far too low for modern large-context
sessions. Raise the default to 5000 — still well clear of any death
spiral, but no longer firing on legitimate long conversations.
The value remains fully configurable via compression.hygiene_hard_message_limit.
The gateway only rewrote gateway_state.json on lifecycle transitions
(start/connect/drain/stop), never on turn start/end. Live-verified on a
hosted agent: a confirmed end-to-end turn ran while gateway_updated_at
stayed frozen at boot and active_agents was absent — so any active_agents
read from the file between transitions is stale. That makes it unusable
as a busy/idle signal for an external consumer (NAS deciding whether it's
safe to restart/migrate/auto-update an agent mid-turn).
Add _persist_active_agents(), called at every turn boundary:
- turn start: both running-agent sentinel-claim sites (normal inbound
message path + startup-resume path)
- turn end: the central _release_running_agent_state() choke point
(covers normal completion, /stop, /reset, sentinel cleanup,
stale-eviction — every path that ends a running turn)
It passes ONLY active_agents to write_runtime_status, leaving
gateway_state (and every other field) _UNSET so the read-merge-write
preserves the current lifecycle state. Passing gateway_state=None would
clobber it — hence a dedicated helper rather than reusing
_update_runtime_status. The write is the same cheap JSON write done on
lifecycle transitions today; best-effort (a failed status write never
disrupts a turn).
Behaviour-contract test: an active_agents-only write preserves both
running and draining gateway_state, and the count clamps non-negative.
After context compression, the agent re-sent an already-delivered
generated image on every subsequent turn (#46627). The auto-append
fallback rescans full history when the message list shrinks (compression-
safe path), deduping against _history_media_paths — but that set was built
by scanning ONLY MEDIA: text tags in tool results. image_generate returns
its path in a JSON payload field (host_image/image/agent_visible_image),
never a MEDIA: tag, so generated-image paths never entered the dedup set
and were re-emitted after the boundary.
Extract the history-path collection into _collect_history_media_paths(),
which now covers BOTH delivery shapes: MEDIA: text tags AND image_generate
JSON-payload paths (mirroring what _collect_auto_append_media_tags
extracts). The inline block in _handle_message is replaced with a call to
the helper.
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
Gateway Session Hygiene auto-compression destroyed the original transcript
when the throwaway hygiene agent couldn't rotate the session (#21301, P1).
The _hyg_agent is built WITHOUT a session_db, so _compress_context cannot
end-and-fork the session (its rotate block is gated on agent._session_db).
The session_id stays unchanged, and the rewrite_transcript() call ran
UNCONDITIONALLY — replacing the full original transcript with just the
head+summary list. Permanent data loss on every hygiene compaction.
Guard the rewrite behind 'rotated OR in-place' exactly like the /compress
path already does (#44794/#39704): only overwrite when a new session id
was minted or in-place compaction succeeded; otherwise preserve the
original transcript and log a warning. The token/count bookkeeping that
followed the rewrite is moved inside the guard, with no-change values in
the preserve branch.
Co-authored-by: SandroHub013 <sandrohub013@gmail.com>
Co-authored-by: WuTianyi123 <wtyopenclaw@gmail.com>
Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
When the agent is busy and the user sends multiple text follow-ups, the
interrupt-mode and steer-fallback path stored them via
merge_pending_message_event(merge_text=True), which newline-joins
consecutive TEXT messages into a SINGLE pending turn — collapsing two
separate user messages into one mashed-together turn and destroying the
message boundaries the user sees (#43066 sub-bug 2).
Route that storage through _queue_or_replace_pending_event (the same FIFO
infrastructure used by busy queue-mode and /queue) so each follow-up gets
its own next-turn slot in arrival order, while still preserving
photo-burst / album merge semantics for media. Pure queue-mode already
used FIFO; this brings the interrupt/steer-fallback path in line.
The sibling defect in #43066 (assistant messages lost after compaction)
was already fixed on main by the identity-tracking flush rewrite (#46053)
plus the pre-rotation flush (#47202), so this only addresses the
remaining busy-message-merge half.
Co-authored-by: KiruyaMomochi <65301509+KiruyaMomochi@users.noreply.github.com>
Async-delegation completions (delegate_task(background=true)) and
background-process completions (terminal notify_on_complete) re-enter the
originating session as internal MessageEvents. When the session was busy,
_handle_active_session_busy_message treated them like a user TEXT message and
the default busy_input_mode='interrupt' aborted the active turn (and sent a
'Interrupting current task' ack) — the opposite of the design invariant that a
completion surfaces as a new turn only when idle.
Short-circuit internal events to return False so the base adapter queues them
silently (it already excludes internal events from debounce), cascading them as
the next turn after the current one finishes.
Review (Codex + 3-agent parallel) found the first cut of in-place mode was
incomplete: it only updated the system prompt, so the persisted transcript
stayed 'full history + summary' and the next turn/resume reloaded the full
history and immediately re-compacted (a loop), and every downstream layer
that keyed off session-id rotation silently no-op'd. The session_id was
doing double duty as the 'compaction happened' signal. This wires the whole
path so removing rotation is actually complete:
Agent (agent/conversation_compression.py):
- In-place now DURABLY replaces the transcript: replace_messages(session_id,
compressed) on the same row (the canonical store the gateway reloads from),
not just update_system_prompt. Resume reloads the compacted set; no loop.
- Reset flush identity/cursor (_last_flushed_db_idx=0, _flushed_db_message_ids
cleared) so next-turn appends diff against the compacted transcript.
- Expose a rotation-independent signal: agent._last_compaction_in_place, and
in_place=True on the session:compress event.
- Fire the compaction-boundary hooks (context-engine on_session_start, memory
manager on_session_switch, reason='compression') in BOTH modes — in-place
passes the same id as parent so DAG/buffer state still checkpoints. Without
this, memory/context plugins miss every in-place compaction.
Gateway auto-compress (gateway/run.py):
- Read agent._last_compaction_in_place; set history_offset=0 on rotation OR
in-place (both return the compacted set, so slicing past the pre-compaction
length would drop everything). Carry compacted_in_place in the result dict.
- No extra rewrite needed: the agent shares the gateway's SessionDB, so its
replace_messages already updated the canonical store load_transcript reads.
Manual /compress (gateway/slash_commands.py):
- The throwaway /compress agent has no _session_db, so rewrite_transcript is
the durable write. Previously gated behind 'if rotated:' which treated
'id unchanged' as the #44794 data-loss failure case and SKIPPED the rewrite
— making /compress a silent no-op in in-place mode. Now rewrites on rotated
OR in_place; the data-loss guard still fires only for the genuine
no-rotation-AND-not-in-place failure.
Hygiene auto-compress already writes _compressed to the same id
unconditionally (its agent has no _session_db, can't rotate) — correct for
in-place, no change.
Tests (tests/run_agent/test_in_place_compaction.py):
- Assert the DURABLE transcript IS the compacted set after reload
(get_messages_as_conversation == compacted), message_count==2, flush
identity reset, and the rotation-independent signal set on in-place /
unset on rotation. Rotation regression guard unchanged.
Verified: 64 tests green across in-place + rotation/persistence/boundary/
concurrent/failure-sync/command/cli suites; E2E both modes (durable replace,
gateway offset=0, rotation preserves old transcript); ruff clean. Still
default-off.
Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging
adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite,
wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained
bundled plugins under plugins/platforms/<x>/, discovered via the platform
registry. Strips the per-platform core touchpoints from gateway/run.py,
gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and
tools/send_message_tool.py.
Carries forward the migration fixes (explicit enabled:false honored,
get_connected_platforms forces discovery, plugin is_connected via
gateway.get_env_value, logs --component gateway matches plugins.platforms.*,
matrix hidden on Windows).
Additionally ports config keys main added since the PR base: the matrix
plugin's _apply_yaml_config now also covers allowed_users,
ignore_user_patterns, process_notices, and session_scope (the inline
gateway/config.py matrix block gained these in the 1340 commits the PR sat
open; they would otherwise have been silently dropped on deletion).
When a tool call itself restarts the gateway (docker restart, systemctl
restart, and similar), the process is terminated mid-call — before the
tool result is persisted and before the orderly drain rewind can run. The
transcript tail is left as an assistant(tool_calls) with no matching tool
answer. On resume the model re-issues the unanswered call, taking the
gateway down again — an infinite loop (#49201).
Source fix: _build_gateway_agent_history now strips a trailing
assistant(tool_calls) block that has no tool answers
(_strip_dangling_tool_call_tail), so there is nothing for the model to
re-execute. This complements _strip_interrupted_tool_tails, which only
handles the case where a tool result row exists with an interrupt marker.
Cognitive backstop: the resume-pending system note now states that any
restart command in the history already ran and must not be re-executed or
verified, and the empty-message auto-resume startup turn reports recovery
and asks for instructions instead of the nonsensical "address the user's
NEW message" (there is no new message on that turn).
Reimplements the intent of #49243 by @JoaoMarcos44 at the replay layer.
Fixes#49201
Third review pass (Hermes subagent) declared convergence: no BLOCKING, the
round-2 generation-aware publish / context-engine staging / CLI reload / ACP
routing all verified correct by hand and by test.
- agent_init: capture _tool_snapshot_generation immediately before the tool
snapshot (was ~425 lines earlier); removes a harmless skew window so the
recorded generation always matches the snapshot it describes.
- gateway/run.py _execute_mcp_reload: keep preserving each cached agent's
build-time enabled_toolsets EXACTLY (do NOT merge newly-connected servers like
CLI/TUI do) and document WHY — gateway sessions can be deliberately locked
down, and test_reload_mcp_preserves_per_agent_toolset_overrides asserts this.
A reviewer suggested "parity" here; it would have violated that contract.
MCP servers that connect after the agent's one-time tool snapshot were
invisible for the whole session. Two root causes, fixed together:
1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers
commonly take 2-6s on a cold connect, so they missed the window and
their tools never entered the agent's snapshot. `thread.join(timeout)`
already returns the instant discovery completes, so raising the bound
costs ~0s for the common case (no MCP / fast servers) and only ever
blocks for a genuinely-pending server, capped so a dead server can't
freeze startup. The bound is now configurable via
`mcp_discovery_timeout` (config.yaml, default 5.0s).
2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI
`reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh
thread), and the late-refresh detected changes by tool COUNT — missing
an equal-size add/remove swap. Consolidated into one shared
`tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by
tool NAME, mutates the agent under a lock (thread-safe), and respects
the agent's own enabled/disabled toolsets.
The late-binding refresh keeps its pre-first-turn cache-safety guard:
it never rebuilds the tool list once a turn has started, so the cached
prompt prefix is never invalidated mid-conversation.
Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the
name-based diff, in-place mutation, agent-scoped filtering, thread
safety, and the config-driven discovery bound (incl. instant-return
when nothing is pending). 75 passed across the touched areas.