Commit graph

2208 commits

Author SHA1 Message Date
Teknium
811df74a10
fix(gateway): defer cross-process cache cleanup off the cache lock (#52197) (#52761)
The #45966 cross-process coherence guard popped the stale cached agent
and then called the blocking _cleanup_agent_resources (memory-provider
shutdown, tool-resource teardown, async-client teardown) while still
holding _agent_cache_lock, on the gateway event-loop thread. While that
ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher)
blocked acquiring the same lock and the asyncio loop stalled for minutes,
tripping repeated Discord 'heartbeat blocked' warnings.

Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry
under the lock, release it, then schedule the SOFT release on a daemon
thread. The soft path (_release_evicted_agent_soft) is also more correct
here than the hard teardown the regression used — the same session
rebuilds a fresh agent immediately after invalidation, so its terminal
sandbox / browser / bg processes (keyed on task_id) must be preserved
for the rebuilt agent to inherit, not torn down.

Verified the cross-process site was the only cleanup-under-lock instance;
the other _cleanup_agent_resources call sites run outside the lock.
2026-06-25 18:58:47 -07:00
Teknium
fd2a35b169
fix: stop reporting cache-hit rate and cost across all UI surfaces (#52717)
* fix: stop reporting cache-hit rate and cost across all UI surfaces

Cost estimates and cache read/write token reporting are unreliable on
providers that don't surface cached_tokens (e.g. ollama-cloud, which doesn't
implement prompt_tokens_details.cached_tokens), producing misleading
near-zero 'cache hit' readouts and cost figures. Remove cost + cache-hit
reporting from every user-facing surface; keep input/output/total token
counts (provider-agnostic and accurate) and the Nous account billing UI
(real account money, separate from per-conversation estimates).

Surfaces:
- CLI /usage + model-info: drop cost lines + cache read/write token lines
- Gateway /usage + /model: drop cost + cache lines
- tui_gateway/server.py: stop emitting cost_usd / cache_read in usage and
  subagent.complete payloads
- TUI (Ink): drop cost from status bar (+ showCost plumbing), /usage panel,
  thinking rollup, agents overlay (incl. compare view); keep token counts
- Desktop Command Center: drop cost stat, per-model cost, actual-cost hint

Underlying estimate_usage_cost / format_cost / insights cost columns are
left intact but no longer surfaced (display-only change, reversible).

* test: update TUI + gateway + CLI tests for removed cost/cache-hit reporting

- CLI /usage test asserts cost/cache lines are absent, tokens present
- gateway /usage test drops cost + cache asserts; removes cost-included test
- TUI subagentTree summary expectation drops the cost segment
- useConfigSync + appChrome status-rule tests drop showCost prop/state
2026-06-25 15:21:22 -07:00
Gille
e7d2f0b93c fix(windows): suppress console flashes and harden gateway restarts 2026-06-25 14:42:38 -07:00
Teknium
c6575df927
feat(moa): expose MoA presets as selectable virtual models (#46081)
* feat(moa): expose MoA presets as selectable virtual models

Reconstructed onto current main (PR #46081's base had diverged with no common
ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual
provider: each named preset is a selectable model under provider 'moa', and the
preset's aggregator is the acting model that answers and calls tools.

Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same
batch pattern delegate_task uses) — all references dispatched at once, collected
when every one finishes, then handed to the aggregator. Output order is
preserved, failures and the MoA-recursion guard stay isolated per reference.

- Removed the old mixture_of_agents model tool and moa toolset.
- Added moa as a virtual provider in the provider/model inventory.
- /moa is shortcut behavior over model selection (default preset / named preset
  / one-shot prompt).
- Dashboard + Desktop manage named presets; presets appear in model pickers.
- Parallel reference fan-out in agent/moa_loop.py with regression test.

* fix(moa): thread moa_config through _run_agent to _run_agent_inner

The reconstructed gateway MoA wiring declared moa_config on _run_agent (the
profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper
never forwarded it — _run_agent_inner had no such parameter, so the runtime hit
NameError: name 'moa_config' is not defined on the compression-failure session
sync path. Add moa_config to _run_agent_inner's signature and forward it from
both wrapper call sites (multiplex and non-multiplex). Caught by
tests/gateway/test_compression_failure_session_sync.py on CI shard test(4).

* fix(moa): classify moa as a virtual provider in the catalog

The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so
provider_catalog() fell through to the default auth_type="api_key" with no
env vars — tripping two catalog invariants:
  - test_provider_catalog: api_key providers must expose a credential env var
  - test_provider_parity: every hermes-model provider must be desktop-configurable

moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that
overlay as an auth_type fallback so the catalog reports moa as virtual (no real
credential, no network endpoint). Exempt virtual providers from the desktop
parity union check the same way 'custom' is exempt — derived from the catalog,
not a hardcoded slug, so future virtual providers are covered too.
2026-06-25 13:52:06 -07:00
srojk34
510bf40705 fix(gateway): read compaction result flag not config flag in hygiene guard (#50098)
Salvage of #50098 by @srojk34, cherry-picked onto current main.

The hygiene auto-compress guard and the /compress slash command both read
compression_in_place (config flag — is in-place mode enabled?) instead of
_last_compaction_in_place (result flag — did in-place compaction actually
succeed?). Both agents are built without a session_db, so archive_and_compact
always fails silently and _last_compaction_in_place stays False. Reading the
config flag makes the guard think in-place succeeded, triggering
rewrite_transcript() which replaces the original messages with only the
compressed summary — permanent data loss.

Co-authored-by: srojk34 <srojk34@users.noreply.github.com>
2026-06-25 12:56:05 -07:00
kshitijk4poor
73c8d5a1e7 fix: use self._session_db directly + add regression test
- Replace getattr(self.session_store, '_db', None) with self._session_db
  (the GatewayRunner's own SessionDB, consistent with existing usage in
  slash_commands.py L240/L499).
- Remove verbose comment referencing a branch name as an issue number.
- Update stale comment in run.py that said 'today it has no session_db'.
- Add regression test verifying session_db is passed and rotated session
  is persisted (adapted from #51624 by @LeonSGP43).
- Add _session_db=None to _make_runner fixtures in test_compress_command,
  test_compress_focus, and test_compress_plugin_engine.
2026-06-26 00:50:40 +05:30
Omar B
1a38a8ff7d fix(gateway): pass session_db to compress temp agents so persistence works
Manual /compress and session hygiene auto-compress both create temporary
AIAgent instances to run compression. These agents were created without
a session_db, so compress_context computed the compressed messages in
memory, rotated the session ID, and reported success — but never wrote
to the database. The next user message reloaded the original full
transcript, making compression appear to do nothing.

Fix: pass session_db=self.session_store._db to both temp agents so the
session rotation is properly persisted. Also set _end_session_on_close
on the /compress temp agent (already done in hygiene path) to prevent
cleanup from ending the newly rotated session.
2026-06-26 00:50:40 +05:30
kshitij
5de8a8fbe8
Merge pull request #52375 from NousResearch/salvage/47237-dedupe-user-turns
fix(gateway): dedupe user turns on transient failure (#47237)
2026-06-26 00:30:59 +05:30
davidgut1982
6208d6b3be fix(gateway): dedupe user turns on transient failure (#47237)
When the gateway persists a user message after a transient provider
failure (429/timeout/auth error), subsequent retries of the same
Telegram message could stack duplicate user turns in the transcript,
causing the agent to fall behind by 1-2 messages.

Add has_platform_message_id() to SessionDB (using the existing
idx_messages_platform_msg_id partial index) and a SessionStore wrapper.
The gateway's transient-failure path checks this before
append_to_transcript -- if the platform_message_id is already
persisted, the duplicate write is skipped.

Salvaged from #47869 by @davidgut1982. Adapted to current main which
has additional append sites and an existing content-based dedupe in
the exception handler path.

Closes #47237
2026-06-26 00:11:17 +05:30
Ben Barclay
d6269da7fd
fix(gateway): harden scale-to-zero dormancy guards (#52359)
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Block scale-to-zero suspend while background async delegations are active, and restore runtime status to running on real inbound after a dormant wake.\n\nAdd regression coverage for both review findings.
2026-06-25 20:41:03 +10:00
Ben Barclay
72ae163250
fix(relay): authorize relay-delivered events by delivery, not source.platform (#52306)
* fix(relay): authorize relay-delivered events by delivery, not source.platform

The #52190 upstream-authz fix keyed _is_user_authorized off
source.platform via _adapter_authorization_is_upstream(source.platform).
But a relay *message* inbound carries the UNDERLYING platform
(source.platform == discord/telegram/...), NOT Platform.RELAY, because
ws_transport._event_from_wire maps the connector's wire payload
(platform="discord") straight onto SessionSource for session-keying and
egress. The relay adapter is registered only under Platform.RELAY, so
adapters.get(Platform.DISCORD) misses, the trusted-upstream branch is
skipped, and the user hits the env-allowlist default-deny:

    WARNING gateway.run: Unauthorized user: <id> (<name>) on discord

(Live staging bug: alpha tester linked successfully, then every
follow-up DM was silently dropped.)

Fix: the authentic trust signal is that the event was delivered over the
per-instance-authenticated relay WS, not which platform it underlies. Add
a wire-INVISIBLE SessionSource.delivered_via_upstream_relay flag, stamped
by the relay transport in _event_from_wire, and authorize on it. The flag
is excluded from to_dict/from_dict so a peer can neither forge it across
the wire nor have it restored from persistence. The existing adapter-flag
check is retained for events whose source.platform IS Platform.RELAY
(interaction-passthrough). A direct Discord event on a multiplexing
gateway (direct + relay adapters) is unmarked and still default-denies.

* fix(relay): use identity check on delivery marker to avoid MagicMock fail-open

A MagicMock() source (used by test_signal.py and other gateway tests) auto-
vivifies source.delivered_via_upstream_relay as a truthy Mock, which a bare
truthiness check would treat as authorized — flipping
test_signal_in_allowlist_maps from False to True. The marker is a real bool on
SessionSource, so check 'is True' explicitly: refuses to authorize any non-bool
stand-in, defensive against accidental fail-open.
2026-06-25 14:21:09 +10:00
Victor Kyriazakos
b177d4ee48 fix(cron): mirror continuable cron as a labelled user turn (alternation-safe)
Addresses review on #51077 (kxee). The continuable-cron mirror reused
gateway.mirror.mirror_to_session, which writes role=assistant — re-
introducing the exact alternation violation #2313 (37a997945)
deliberately removed: a cron brief landing as assistant after the
agent's last turn yields assistant->assistant, which breaks strict-
alternation providers (OpenAI/OpenRouter) per issue #2221. The mirror/
mirror_source metadata is also dropped at the SQLite boundary, so the
[Delivered from cron] label is lost on replay.

This is an intentional, opt-in (default OFF) reversal of #2313's
'cron output does not belong in interactive history' for the reply-to-
cron use case — gated behind cron.mirror_delivery / attach_to_session.

Fixes:
- mirror_to_session gains a role param (default 'assistant' — interactive
  send_message mirror unchanged, it IS the agent speaking). Cron paths
  pass role='user' with a '[Cron delivery: <task>]' prefix so the brief
  collapses via repair_message_sequence's consecutive-user merge on every
  provider, and stays distinguishable on replay despite the metadata drop.
- thread_seeded: defer seeding + the flag until delivery into the new
  thread actually succeeds. Previously set pre-delivery, so an open-
  succeeds / deliver-fails case both stranded a seeded-but-unseen brief
  AND suppressed the DM-fallback mirror.
- seed mirror now passes user_id='system:cron' to resolve the exact
  thread-keyed session row it just created.
- dedupe the duplicate BasePlatformAdapter import in _deliver_result.
- trim oversized docstrings to non-obvious WHY (AGENTS.md).
- docs: document cron.mirror_delivery / attach_to_session in
  website/docs/user-guide/features/cron.md.
- test: assert the cron mirror writes role='user' with the label prefix.

204 cron+mirror tests pass.
2026-06-24 20:27:05 -07:00
Ben
0c3f197cff fix(relay): re-attach DM author user_id on outbound for connector egress
A DM reply carries no guild_id, so the connector's egress guard cannot
resolve the owning tenant from metadata.guild_id and declines the send
with "discord egress declined: target not routed to an onboarded tenant"
— the bug behind "the bot never replies in DMs". Guild replies are
unaffected (they carry guild_id), which is why the guild path worked
end-to-end while DMs looked broken.

The connector now resolves a DM reply's tenant from the recipient's
author binding (gateway-gateway #67, resolveByUser keyed on
metadata.user_id) — the outbound counterpart to inbound Phase 7a
author-first resolution. But it needs the recipient user_id ON the
outbound action, and the adapter only re-attached guild_id
(_capture_scope/_with_scope), no-op for DMs (the docstring even said so).

This extends the adapter's inbound-scope capture: for a DM (no guild_id)
remember chat_id -> the authentic author user_id we observed, and
re-attach it as metadata.user_id on outbound. Guild capture is unchanged
and wins when present; user_id is the DM-only fallback. The id is the one
the connector observed inbound (never gateway-asserted), so the trust
invariant holds.

+4 unit tests (DM reply re-attaches user_id + no guild_id; unknown chat
invents nothing; explicit user_id preserved; guild reply never carries
user_id). Proved load-bearing (reverting the re-attach fails the DM
test). 144 relay tests pass, ruff clean.

Pairs with gateway-gateway #67 (the connector-side resolver). Together
they close the DM-reply egress gap end-to-end.
2026-06-25 12:43:54 +10:00
Ben
d1cac0e5ef feat(gateway): scale-to-zero idle detection + dormant-quiesce (Phase 0)
The gateway-side BEHAVIOUR layer that consumes the relay scale-to-zero
primitives (gateway-gateway Phase 5): the gateway decides it is idle and
drives the relay transport dormant so the platform (Fly autostop:"suspend")
can suspend the now-traffic-idle machine, which wakes on the connector's
wakeUrl poke (decisions.md Q3=C', D1-D13).

- gateway/scale_to_zero.py: pure helpers — scale_to_zero_enabled (the NAS
  Labs HERMES_SCALE_TO_ZERO stamp, D11/Q8=A), parse_idle_timeout_seconds
  (config.yaml gateway.scale_to_zero.idle_timeout_minutes, D2),
  messaging_is_relay_only_or_absent (F6/D1), should_arm (D1/D11/§3.4(1)),
  is_idle (D2/D3/F7).
- gateway/run.py: _last_inbound_at clock stamped on user inbound in
  _handle_message (F13); the arm-gate + idle predicate + the
  _scale_to_zero_watcher dormant sequence (mark draining -> adapter
  go_dormant() -> cooldown), started only when armed. Deliberately NOT the
  stop path and NOT mark_resume_pending (F12/D13).
- tools/process_registry.py: has_any_active() for the bg-work guard (D3/F7).
- hermes_cli/config.py: gateway.scale_to_zero.idle_timeout_minutes default 5.

Tests: 38 pure-logic + 6 watcher (incl. bg-work regression guard proven RED).
Full relay + scale-to-zero suites: 184 passed. The 20 unrelated failures in
the broader run are PRE-EXISTING on origin/main (custom-provider/tools tests),
confirmed via a pristine baseline worktree.
2026-06-24 18:47:18 -07:00
Ben
96af4bec30 feat(relay): add go_dormant() transport mode for scale-to-zero (0.E0)
Net-new WebSocketRelayTransport.go_dormant() + RelayAdapter.go_dormant() —
the third transport mode the scale-to-zero behaviour layer needs, distinct
from both disconnect() and an unexpected close (decisions.md D12/F14):

- disconnect() sets _closing=True and CANCELS the reconnect supervisor
  (terminal "shutting down for good") -> a suspended machine never re-dials
  on wake, stranding its buffered backlog.
- an unexpected close re-dials IMMEDIATELY -> the socket never stays down,
  so the platform proxy never suspends the machine.

go_dormant(): going_idle->ack (reuse go_idle), then close the socket WITHOUT
setting _closing, so the reader's fall-through still arms the reconnect
supervisor (wake path stays live) but on the longer _dormant_redial_s
cadence so it doesn't fight the platform suspend window. A successful re-dial
clears _dormant. Honors the §3.4 wake->reconnect->drain contract.

Tests: 6 new in test_relay_going_idle.py incl. the F14 regression guard
(routing dormancy through disconnect() fails exactly the 4 wake-path tests).
Full relay suite 140 passed.
2026-06-24 18:47:18 -07:00
Ben
538c419d2e fix(gateway): scope dashboard liveness fallback to the profile
PR #52151 hardened the runtime-status liveness check to trust a readable
live process command line over stale gateway_state.json argv, so a recycled
PID now owned by an s6 supervisor no longer counts as a running gateway.

That fix is correct but incomplete for the reported symptom: the web
dashboard showed a named profile's gateway green while
`hermes -p <name> gateway status` showed it stopped. Two further issues:

1. Cross-profile PID reuse. In per-profile Docker supervision, one profile's
   stale `gateway_state.json` can record a PID the OS later recycled onto a
   DIFFERENT profile's live gateway. That PID's command line still
   `looks_like_gateway`, so the dead profile was reported running. The
   recorded argv has its `-p <name>` selector stripped in-process by
   `_apply_profile_override`, so it cannot disambiguate; the live `/proc`
   cmdline still carries it. `get_runtime_status_running_pid` now accepts an
   `expected_home` and validates the live command line belongs to THAT
   profile (mirroring `hermes_cli.gateway._matches_current_profile`, the
   logic the CLI scan path already uses — which is why the CLI was correct).
   `_check_gateway_running` passes the enumerated profile dir.

2. The existing regression test `test_gateway_running_check_falls_back_to_
   runtime_state` used the live pytest PID with a gateway-shaped record; once
   the live cmdline became authoritative it no longer looked like a gateway.
   Updated to mock the live cmdline to the real separate-process scenario it
   describes.

The active-profile path (`get_running_pid`) is intentionally left unscoped:
it is lock-verified and any live gateway cmdline is acceptable there. Multiplex
mode is unaffected — `running` state is only ever written to a gateway's own
home, never a secondary served profile's.

Adds coverage for: cross-profile PID reuse (named + default), matching
profile cmdline (`-p`, `--profile`, explicit HERMES_HOME=), the bare default
gateway, and the unreadable-cmdline cross-platform fallback. Each new
cross-profile assertion fails without the profile scope and passes with it.

Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>
2026-06-25 10:25:54 +10:00
helix4u
f1617a7ebb fix(gateway): validate runtime status pid command line 2026-06-25 10:25:54 +10:00
Ben
d335164833 fix(relay): authorize relay inbound via connector-enforced upstream authz
A hosted instance fronted by the Team Gateway connector dropped EVERY relay
message as "Unauthorized user" and the agent never replied — despite the
message routing correctly through the connector to the instance.

Root cause: gateway authorization (_is_user_authorized) had no notion of
upstream-enforced authz. Platform.RELAY matches no {PLATFORM}_ALLOWED_USERS
allowlist and isn't in the HA/WEBHOOK always-authorized set, so a relay user
with no env allowlist configured hit the default-deny ("No user allowlists
configured. All unauthorized users will be denied."). The message was received,
then silently denied before reaching the agent.

This is incorrect for relay: the connector authenticates the gateway's WS with
a per-instance secret and performs owner-only author-binding resolution BEFORE
delivering. A message only reaches this gateway because the connector resolved
it to THIS instance's bound user (user_instance_binding), keyed on the author id
the connector OBSERVED off the event — never a gateway claim. The authorization
decision is already made by a trusted, authenticated upstream; there is no local
RELAY_ALLOWED_USERS allowlist to consult, and default-denying for its absence is
the bug.

Fix: add a generic BasePlatformAdapter.authorization_is_upstream capability
(default False) that the relay adapter overrides to True, plus a dedicated
trusted branch in _is_user_authorized that honors it. This is delegation to a
trusted upstream, NOT a fail-open: it fires only for an adapter that explicitly
declares the flag; every direct network-exposed adapter leaves it False and the
env-allowlist default-deny (SECURITY.md §2.6) is unchanged. Distinct from
enforces_own_access_policy, which mirrors a LOCAL config-driven allowlist —
this delegates to an authenticated upstream's decision.

Tests: behavior contract that the base defaults False, the relay adapter
declares True, a relay user (group + DM) is authorized with no env allowlist,
and crucially a non-upstream adapter with no allowlist still default-denies
(guards against the fix becoming a blanket fail-open). 6 new tests; relay +
authz + config-policy suites green (134 + 90).

Found via live staging debug of the Discord self-serve onboarding flow.
2026-06-25 10:06:21 +10:00
liuhao1024
404b06ac4f fix(gateway): honor server retry_after in _send_with_retry for Telegram flood control (#46762)
When Telegram's sendRichMessage returns a FloodWait/RetryAfter error,
_try_send_rich() now extracts the server-provided retry_after value and
propagates it through SendResult.retry_after. The base _send_with_retry()
layer honors this value instead of using its default short exponential
backoff (~2s, ~4s), preventing the retry budget from being exhausted
against a server that demands a 25-37s wait.

Salvaged from #46774 by @liuhao1024. Telegram adapter path moved from
gateway/platforms/telegram.py to plugins/platforms/telegram/adapter.py
since the original PR.

Closes #46762
2026-06-25 02:43:47 +05:30
kshitijk4poor
9c994377ed fix(gateway): skip non-dict entries in session loading (#46994)
Corrupted sessions.json entries (e.g. a bare bool where a dict is
expected) caused TypeError on 'origin' in data' which escaped the
(ValueError, KeyError) inner except and aborted loading ALL remaining
sessions, not just the corrupted one.

Two-layer fix:
- Loop level: isinstance(entry_data, dict) guard before from_dict
- from_dict: isinstance(data['origin'], dict) instead of bare truthiness
- Added TypeError to the inner except as defense-in-depth

Closes #46994
2026-06-25 01:26:13 +05:30
kshitijk4poor
e0272cfef2 Revert "fix(compression): make minimum context floor configurable (#31600)"
This reverts commit cae1ee44a7.
2026-06-25 01:04:44 +05:30
Tranquil-Flow
cae1ee44a7 fix(compression): make minimum context floor configurable (#31600)
Add compression.minimum_context_floor config key that allows users
to lower the compression threshold floor below the hardcoded 64K
default, preventing infinite tool-call loops on models whose
structured output degrades well before 64K tokens.

- agent/model_metadata.py: add get_configurable_minimum_context()
  helper with 16K hard safety limit
- agent/context_compressor.py: accept minimum_context_floor param,
  thread it through _compute_threshold_tokens
- agent/conversation_compression.py: use compressor's floor for
  aux model context validation
- agent/agent_init.py: read compression.minimum_context_floor from
  config and pass to ContextCompressor
- gateway/run.py: cache-busting includes new key

Salvaged from #31686 by @Tranquil-Flow onto current main.
Resolves conflicts with in-place compaction (#38763) and max_tokens
threshold computation (#43547) that landed after the original PR.

Closes #31600
2026-06-25 00:56:04 +05:30
kshitij
9214aa7dde
Merge pull request #52090 from NousResearch/salvage/35994-reset-deadlock
fix(gateway): offload agent cleanup off the event loop in /new reset (#35994)
2026-06-25 00:34:21 +05:30
kshitijk4poor
0225480369 fix(gateway): offload agent cleanup off the event loop in /new reset (#35994)
The /new (and /reset) confirmation-button callback runs the slash-confirm
handler on the asyncio event loop (see _request_slash_confirm). That handler
calls _handle_reset_command, which invoked the SYNCHRONOUS, potentially
long-blocking _cleanup_agent_resources inline: agent.close() tears down
terminal sandboxes, browser daemons and background processes (subprocess
waits), and shutdown_memory_provider() can make a network call. A slow
teardown wedged the entire event loop, so the bot went silent and stopped
processing all messages until a manual restart.

Offload _cleanup_agent_resources via the existing contextvar-preserving
_run_in_executor_with_context helper, bounded by asyncio.wait_for with a
named _RESET_CLEANUP_TIMEOUT_S (30s). The loop is never blocked; on timeout
the reset proceeds and the worker thread is left to finish on its own (it
cannot be cancelled). The text /new path is unaffected (already off-loop).

Tests (tests/gateway/test_35994_reset_button_deadlock.py): the loop keeps
ticking while close() blocks in its worker thread; a cleanup that raises is
swallowed (warning logged) and the reset still rotates the session; a
cleanup that times out degrades gracefully. All three are mutation-verified
to fail without their respective production branch.
2026-06-25 00:27:22 +05:30
sweetcornna
b41d9b845d fix(gateway): surface retry hint instead of silently dropping turn after /stop (#31884)
After /stop, the next user message can hit a stale generation token and
return with api_calls=0, no failure, no interruption. _normalize_empty_agent_response
fell through to an empty string, so the gateway logged "response=0 chars"
and sent nothing — the message was silently lost while internal work
sometimes continued.

Add the api_calls==0 / not-failed / not-interrupted / not-partial branch
to the single normalization chokepoint so the user gets a short retry hint
instead of silence. Regression test asserts the hint surfaces.

Salvaged from #33851 (re-applied on current main; original was 1401 commits
behind and the function had moved).
2026-06-24 23:51:31 +05:30
kshitij
ae20c3fb90
Merge pull request #51025 from NousResearch/salvage/cron-autoreset-override
fix(gateway): consume was_auto_reset so /model survives session auto-reset (#48031)
2026-06-24 19:20:11 +05:30
x7peeps
6879d77d74 fix(gateway): consume was_auto_reset so /model survives session auto-reset
When `/model X` is the FIRST message after an idle/daily/suspended auto-reset,
the slash-command path stores a session model override but leaves
`session_entry.was_auto_reset = True` (it never passes through
`_handle_message_with_agent`, which is where the flag was consumed). On the
NEXT regular message, the auto-reset cleanup block pops the freshly-stored
model/reasoning override BEFORE the flag is consumed — so the switch is
silently lost and resolution falls back to the config default, while the
session DB still shows the switched model (a two-sources-of-truth divergence).

Consume the flag at both sites:
  1. gateway/run.py — capture `was_auto_reset` into a local and set the
     attribute False immediately at the top of the cleanup block, so the
     cleanup can't re-fire on a later message and wipe an override stored
     between turns. Downstream reads use the captured local.
  2. gateway/slash_commands.py — the model path consumes the flag before
     storing the override, so a /model-first-after-auto-reset isn't wiped by
     the next message's cleanup.

Salvaged from #48062 by x7peeps (authorship preserved).

Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST
invariants pinning both consume sites (load-bearing; verified they fail when
either consume is removed). Mirrors the AST-pin approach in
test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed.

Fixes #48031
2026-06-24 19:12:44 +05:30
kshitij
d68a133458
Merge pull request #51890 from NousResearch/salvage/40695-handoff-watcher-async
fix(gateway): offload handoff-watcher SQLite calls to avoid blocking the async heartbeat (#40695)
2026-06-24 19:10:52 +05:30
liuhao1024
721cf54fb1 fix(gateway): offload /model provider-listing off the event loop (#41289)
The Discord/Telegram /model slash command listed providers synchronously
on the gateway's async event loop. list_picker_providers /
list_authenticated_providers are blocking and can fall through to a
synchronous urllib HTTP fetch when the on-disk provider cache is stale,
freezing the loop for 120-150s -> "application did not respond" and
delayed agent starts.

Port #41304's asyncio.to_thread offload to the current handler location.
The handler moved from gateway/run.py to gateway/slash_commands.py
(_handle_model_command); wrap BOTH blocking call sites so the whole bug
class is covered:

  - picker path        -> list_picker_providers
  - text-fallback path -> list_authenticated_providers

asyncio.to_thread is already idiomatic in this module (and asyncio is
imported), so the loop now stays responsive while the (possibly
network-bound) listing runs on a worker thread.

Adds tests/gateway/test_model_command_async_offload.py asserting the
offload contract at the real handler seam for both paths (mutation-
survivable: reverting either to_thread wrap fails the matching test).

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 18:40:52 +05:30
r266-tech
f0c5d812b0 fix(gateway): offload handoff watcher SessionDB polling off the event loop
The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked
for more than N seconds') because _handoff_watcher polled the synchronous,
blocking SQLite-backed SessionDB directly on the asyncio event loop every
2s. Each list_pending/claim/complete/fail call performed blocking disk I/O
on the loop thread, starving the Discord heartbeat coroutine.

Wrap every blocking SessionDB call inside the watcher loop in
asyncio.to_thread(...) so the SQLite work runs on a worker thread and the
event loop (and heartbeat) stays responsive. These four call sites are the
only synchronous self._session_db.* calls inside the watcher loop body.

Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher
offloads its SessionDB calls via asyncio.to_thread (mutation-survivable:
reverting any to_thread wrap fails the corresponding assertion).

Fixes #40695

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 18:40:23 +05:30
Ben
c93b9f9057 feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state
Some checks are pending
CI / detect (push) Waiting to run
CI / tests (push) Blocked by required conditions
CI / lint (push) Blocked by required conditions
CI / typecheck (push) Blocked by required conditions
CI / docs-site (push) Blocked by required conditions
CI / history-check (push) Blocked by required conditions
CI / contributor-check (push) Blocked by required conditions
CI / uv-lockfile (push) Blocked by required conditions
CI / docker-lint (push) Blocked by required conditions
CI / supply-chain (push) Blocked by required conditions
CI / osv-scanner (push) Blocked by required conditions
CI / All required checks pass (push) Blocked by required conditions
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway
relay (Unit 7b deprovision), the connector revokes the per-gateway secret and
closes the gateway's WS with 4401. The reconnect supervisor previously treated
EVERY close as retryable, so the live process spun "retrying 4401" forever and
the dashboard showed a red error — opt-out looked like a failure.

Now a 4401 close that arrives AFTER a successful handshake is recognized as a
terminal credential revocation:

- ws_transport.py: track `_handshake_succeeded` (set when a descriptor is
  received); on a 4401 close after a prior success, latch `auth_revoked` and do
  NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake
  stays retryable (cold-start / not-yet-provisioned race, not a revocation).
  New `auth_revoked` property + a websockets-version-safe close-code reader
  (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+).
- adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean,
  NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error
  handler (so the adapter is removed and NOT queued for reconnection — the
  credential is dead until the instance is recreated). Monitor is cancelled on
  disconnect; only started when the transport exposes `auth_revoked` (prod WS).
- run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a
  `disabled` platform_state (not `fatal`/`retrying`).
- web: PlatformsCard renders the `disabled` state with a neutral outline badge,
  a PowerOff icon, and muted (not destructive-red) text + message. New optional
  `status.disabled` i18n string ("Disabled").

Also bundles the Phase 7 contract-doc update (this doc is authoritative in
hermes-agent): docs/relay-connector-contract.md gains an "Author-first
resolution + the account-link (DM) path" section documenting the
multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by
guild; unlinked → fail-closed), the `/link <code>` DM flow, and the
connector-authoritative opt-out + terminal-4401 behavior this PR implements.

Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect;
4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable
relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests
pass (incl. the contract-doc conformance test); ruff clean; web tsc clean.

Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live
de-register, no recreate) + the restart-re-provision hole deferred post-alpha.
2026-06-24 18:43:01 +10:00
Chaz Dinkle
abc3662bf6 fix(gateway): detect launchd in /restart service-manager probe (#43475)
On a launchd-managed gateway (macOS), /restart stopped the gateway but
never relaunched it: the handler's service detection checks only
INVOCATION_ID (systemd) and container markers, so under launchd it takes
the detached path and exits 0 — which KeepAlive.SuccessfulExit=false
treats as a deliberate stop. The gateway stays silently dead until a
manual launchctl kickstart.

Detect launchd via XPC_SERVICE_NAME, which launchd sets to the job label
for processes it spawns. The probe deliberately excludes the literal
"0": interactive macOS shells inherit XPC_SERVICE_NAME=0 (a truthy
string), and routing an unsupervised interactive gateway to the service
path would make it exit non-zero with nothing to revive it.

Routing through via_service=True (rather than forcing a non-zero exit
on the detached path) matters: the detached path also spawns a helper
that relaunches the gateway, so exiting non-zero there would have BOTH
the helper and launchd respawn it — two gateways racing for the same
bot tokens. The service path spawns no helper; launchd is the single
respawner.

Fixes #43475. Supersedes the run.py-era probes in #19940/#33393 (the
handler has since moved to gateway/slash_commands.py) and avoids the
double-spawn risk in the exit-code-site approaches (#43498, #43596).
2026-06-24 00:14:25 -07:00
Teknium
0ef86febe2
docs(sessions): clarify sessions.json is the gateway routing index, not the session list (#51726)
Users who inspect ~/.hermes/sessions/sessions.json see only gateway entries
(e.g. agent:main:whatsapp:dm:...) and mistake it for the session index that
hermes sessions list / /sessions read — which is actually state.db. Issue
#49361 reported CLI sessions as 'invisible' on this premise.

- gateway/session.py: write a self-documenting _README sentinel at the top of
  sessions.json explaining it's the gateway routing index and that ALL sessions
  (CLI/TUI/gateway) live in state.db; skip _-prefixed keys on load so the
  sentinel never round-trips into a SessionEntry.
- Harden every sessions.json reader against the sentinel: mcp_serve loader,
  gateway/mirror.py, gateway/channel_directory.py all skip _-prefixed keys.
- docs/user-guide/sessions.md: warning callout naming the exact symptom.
- tests: assert prune ignores metadata sentinels; add round-trip coverage.
2026-06-23 23:56:36 -07:00
uperLu
0d4cecb352 fix(cron): avoid provider package shadowing core cron 2026-06-23 23:39:22 -07:00
Ben
31bced1607 fix(profiles): detect a separate-process gateway in profile status
The dashboard Profiles view showed "Gateway stopped" for a gateway that
is in fact running — while the sidebar status strip and `hermes gateway
status` (CLI) both correctly showed it running. Reported on v0.17.0
running the gateway + dashboard in one Docker container.

Root cause: three liveness surfaces with three detection strengths, all
reading the same `gateway.pid`:

  - `hermes gateway status` -> find_gateway_pids() (process-table scan)
  - sidebar /api/status     -> get_running_pid() + gateway_state.json PID
                               fallback + health-URL probe
  - Profiles view           -> _check_gateway_running() = get_running_pid()
                               ONLY, no fallback

`get_running_pid()` short-circuits to None the moment the runtime lock
(`gateway.lock`) doesn't register as held by the *calling* process —
which is always true when the reader is a separate process from the
gateway (the dashboard is its own s6 service in the container), and also
for any launch-service-managed gateway that left a fresh
`gateway_state.json` but no live PID file. So the Profiles view alone
reported the live gateway as stopped.

Fix: give _check_gateway_running the same fallback the sidebar already
has — after the pid-file/lock check misses, validate the PID recorded in
that profile's gateway_state.json against the live process table via the
existing get_runtime_status_running_pid(). read_runtime_status() gains an
optional path arg so a profile's state file can be read without mutating
the process-global HERMES_HOME (preserving the contextvar-based profile
isolation the dashboard relies on). Backward compatible: every existing
caller passes no argument.

Tests: a regression test that fails pre-fix (live gateway, lock check
returns None -> must still report running) and a guard test that a
'stopped' state file is never reported running even with a live PID.
2026-06-24 16:36:17 +10:00
teknium1
366c2a3766 fix(gateway): propagate fatal-config exit code through start_gateway clean-exit path
The contributor PR stamped runner._exit_code=78 on non-retryable startup
errors, but start_gateway()'s clean-exit branch returned True before the
SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish
script's [ "$1" = "78" ] check never matched and s6 crash-looped the
gateway anyway — the fix was dead as shipped (#51228).

Honor runner.exit_code in the clean-exit branch: raise SystemExit(code)
when set, else return True (normal /restart clean exit). Add a
start_gateway()-level test that asserts process-level SystemExit(78)
propagation — the gap the PR's object-level test missed — plus exit_code
on the existing _CleanExitRunner mocks.
2026-06-24 16:34:51 +10:00
Francesco Mucio
776f68e1ee fix(gateway): exit 78 (EX_CONFIG) on fatal startup errors, s6 finish script stops restart loop
Profiles without their own messaging token inherit the default
profile's token via os.getenv, hit a token collision, and exit with
startup_failed.  s6 restarts them immediately, creating ~30MB tirith
sandbox dirs in /tmp each cycle — filling the disk in hours (#51228).

Changes:
- gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78
- gateway/run.py: set exit_code=78 on non-retryable startup errors
  (token collision, no platforms)
- hermes_cli/service_manager.py: add _render_finish_script() that
  translates exit 78 → exit 125 (s6 permanent failure)
- hermes_cli/container_boot.py: write finish script alongside run
  script during profile registration

The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish.

Closes #51228
2026-06-24 16:34:51 +10:00
jeremy gu
044996e403 fix(gateway): track no-systemd restart runtimes 2026-06-23 23:29:28 -07:00
helix4u
06cbc3bae9 fix(photon): recover degraded upstream stream 2026-06-23 21:33:10 -07:00
Ben Barclay
6e88f7b6f7
feat(relay): Phase 5 Unit C — wake primitive (gateway side) (#51595)
Register a per-instance wakeUrl and forward it to the connector at
self-provision so a suspended gateway can be poked awake when buffered
work arrives (pairs with the connector-side WakePoker).

- relay_wake_url() resolver (env GATEWAY_RELAY_WAKE_URL, then
  gateway.relay_wake_url in config.yaml), mirroring relay_instance_id()
- thread wake_url through _post_provision (adds wakeUrl to the body only
  when set) + self_provision_relay (resolve, forward, log)
- hermes gateway enroll --wake-url <url> persists GATEWAY_RELAY_WAKE_URL
- document the §5.2 wake poke in relay-connector-contract.md §3.3
- tests: relay_wake_url resolution (env/config/absent), provision
  forwarding, body-only-when-set (6 new; 130 relay tests pass)

The actual reconnect+drain on wake is Unit B's loop; this unit only
wires the wake SIGNAL. Opt-in: absent wakeUrl => connector never pokes.
2026-06-24 11:00:11 +10:00
Ben Barclay
40fddc9e4c
feat(relay): Phase 5 §5.3 going-idle / buffered-flip primitive (gateway side) (#51572)
The gateway half of the going-idle/buffered-flip primitive (scale-to-zero
PRIMITIVE, not the behaviour). Integrates with the EXISTING drain transition:

- ws_transport: `go_idle()` sends `going_idle` + awaits the connector's
  `going_idle_ack` (connector-authoritative flip-then-ack, Q-5.3c — stays
  serving until the ack so nothing is lost in the flip window); acks a buffered
  inbound (bufferId present) via `inbound_ack` after the handler runs
  (drain-without-dup on the delivery leg); NET-NEW reconnect loop re-dials +
  re-handshakes after an unexpected close (off by default, on in production).
- adapter: emits `going_idle` from its existing `disconnect()` drain seam before
  tearing down the socket; best-effort + guarded (never blocks shutdown).
- transport Protocol + contract doc §3.2 document the 3 new frames.

+6 relay tests (124 pass). NOT in scope: the autonomous idle timer / machine
suspend / NAS health model (deferred behaviour). Ben's relay-adapter solo lane.
2026-06-24 09:50:30 +10:00
fyzanshaik
0ba1dfed78 fix(gateway): refuse model switch on stale checkout to avoid env_float ImportError 2026-06-24 04:16:54 +05:30
islam666
0c79992db5 fix(gateway): preserve _session_tasks on guard mismatch to enable stale lock healing (#48300)
_session_task_is_stale() failed to detect a stale session lock when the owner
task completed and cleaned _session_tasks (del in _process_message_background's
finally) but _active_sessions was NOT released because _release_session_guard
skipped on a guard mismatch (a concurrent reset/new command or drain handoff
swapped _active_sessions[key] to a different guard). With no owner task left to
inspect, _session_task_is_stale reported 'not stale', the orphaned guard was
never healed, and the session deadlocked permanently — later messages received
but never dispatched.

Reorder the finally cleanup to release-then-conditional-delete: release the
guard first, then drop the _session_tasks entry ONLY if the guard was actually
released (session_key no longer in _active_sessions). On a guard mismatch the
done-task entry survives, so the on-entry self-heal (_session_task_is_stale ->
_heal_stale_session_lock) detects the stale lock and clears it on the next
inbound message.

Extracted the cleanup into a callable _cleanup_finished_session_task() helper so
the regression test drives the REAL production code path rather than a copy of
its logic (the original test inlined the fixed logic and passed regardless of
the production order — mutation-verified the rewritten tests now fail on the
buggy del-first order). Added a positive-path test (guard matches -> release +
delete) so both branches are pinned.

Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
2026-06-24 03:06:21 +05:30
Teknium
e32ebc6aa2
feat(skills): /learn — distill a reusable skill from anything you describe (#51506)
Open-ended skill learning across every surface. /learn <free text> takes a
description of any source — a directory, a URL, the workflow you just walked
the agent through, or pasted notes — and the live agent gathers it with the
tools it already has (read_file/search_files, web_extract, the conversation,
the pasted text), then authors a SKILL.md via skill_manage following the
house authoring standards (<=60-char description, the standard section order,
Hermes-tool framing, no invented commands).

No engine, no model-tool footprint, works on any terminal backend (local,
Docker, remote): /learn builds a standards-guided prompt and hands it to the
agent as a normal turn.

- agent/learn_prompt.py: shared standards-guided prompt builder
- /learn registry entry (both surfaces) + CLI handler (inject onto input
  queue) + gateway handler (rewrite turn, fall through, /blueprint pattern)
- tui_gateway command.dispatch returns a send directive -> TUI + dashboard chat
- dashboard Skills page 'Learn a skill' panel (dir + URL + open-ended text)
  composes a /learn request and runs it in chat
- docs (slash-commands ref + skills feature page), 11 targeted tests

Inspired by OpenAI Codex's Record & Replay and the /learn concept from #47234
(dir-distillation engine); reworked to be open-ended and engine-free per
review.
2026-06-23 13:51:28 -07:00
Teknium
6cc07b6cd0
feat(discord): render reasoning as -# subtext via display.reasoning_style (#51168)
Adds a per-platform display.reasoning_style setting (code | blockquote |
subtext) controlling how the show_reasoning summary renders on the gateway.
Discord defaults to "subtext" (-# small grey metadata text); every other
platform keeps the fenced code block. Resolves through the existing
display.platforms.<platform>.reasoning_style override chain.
2026-06-23 10:44:02 -07:00
Ben Barclay
45bc4fb37f
feat(relay): declare relevance policy to the connector + document the management plane (#51248)
The gateway half of Phase 6 Unit ζ: project the agent's existing relevance
knobs into the connector's platform-agnostic vocabulary and declare them at boot
over the /relay/policy route, so the SAME mention-gating / free-response /
allow-bots behavior the agent applies directly also governs relay delivery (and
excluded chatter never wakes a scaled-to-zero agent).

- gateway/relay/__init__.py:
  - relay_relevance_policy(): project require_mention -> requireAddress,
    free_response_channels -> freeResponseScopes, {PLATFORM}_ALLOW_BOTS in
    {mentions,all} -> allowOtherBots. Reads the fronted platform's config block
    + bridged top-level keys. Returns None when all-default (the connector's
    quiet default already matches) or no concrete platform is fronted.
  - send_relay_policy(): POST /relay/policy authenticated with the gateway's own
    per-gateway upgrade token (make_upgrade_token — same bearer as the WS
    upgrade), so the connector attaches it to the authenticated instance, never
    a body-asserted id. Re-declares every boot (self-healing, full replace).
    NEVER raises, NEVER blocks boot — relevance is an optimization layered on
    the δ/ε authorization gate. Reuses the per-gateway secret + the
    /relay/provision host; no new inbound surface, no new credential.
  - _policy_url(): ws(s)://…/relay -> http(s)://…/relay/policy.
- gateway/run.py: call send_relay_policy() after register_relay_adapter()
  succeeds (the secret is resolved by then).
- docs/relay-connector-contract.md: new §7 documenting per-instance delivery +
  the management plane (/manage/* + /relay/policy) + the relevance-declaration
  contract; versioning renumbered to §8. Contract conformance test stays green
  (§2/§3 tables untouched).

Tests: +12 (projection mapping incl. comma-string + top-level fallback; send
auth/skip/fail-soft/non-200). Full relay suite 118 pass. The connector route is
already E2E-proven (connector repo gateway_policy_driver.py); this adds the real
gateway send-path it pairs with.

This completes Phase 6 (Team Gateway per-user isolation) end to end.
2026-06-23 18:43:19 +10:00
kshitijk4poor
0e69cd4b37 fix(memory): honor configured char limits in the no-agent on-disk store
Follow-up to the /memory approve fresh-store fix. Both the CLI fallback and
the messaging-gateway handler built a bare MemoryStore() with the hardcoded
default char limits (2200/1375), ignoring the user's configured
memory.memory_char_limit / user_char_limit. A live agent honors those
overrides (agent/agent_init.py), so an approval applied without a live agent
could accept a write the user's lower cap would reject, or vice versa.

Extract a shared tools.memory_tool.load_on_disk_store() factory that reads
the configured limits (falling back to defaults if config can't load) and
wire both the CLI and gateway handlers to it, closing the gap on both
surfaces and de-duplicating the construction block.
2026-06-23 03:10:53 +05:30
kshitijk4poor
100e7be20e fix(security): deny root-level credential stores in media delivery
The media-delivery denylist in gateway/platforms/base.py enumerated only
.env/auth.json/credentials/config.yaml under HERMES_HOME, so other
credential stores that live at the root fell through and could be
auto-attached to chat replies. The reported case: the Google Workspace
skill's google_token.json refreshes every turn, bumping its mtime to
'now', which kept passing the strict-mode recency window and re-sent the
OAuth token on every reply.

Extend the explicit per-file denylist to mirror the canonical credential
set already enforced by the read/write guards in agent/file_safety.py:
google_token.json, google_oauth_pending.json, auth/google_oauth.json,
.anthropic_oauth.json, webhook_subscriptions.json, cache/bws_cache.json,
auth.lock, and the pairing/ token directory.

Targeted per-file additions (not a blanket ~/.hermes deny, which was
declined in #32090/#34425 because it would block skills/, logs/, and
ad-hoc agent-written deliverables). mcp-tokens/ (#37222) and
state.db/kanban.db (#41071) are left to their sibling targeted PRs.

Reported-by: xxxigm (#50912)
2026-06-23 02:56:48 +05:30
Teknium
2ba1cfeb2e
feat(goals): completion contracts for /goal — evidence-based judging (#50501)
Adds an optional structured completion contract to the standing-goal loop,
adapted from OpenAI Codex's /goal guidance (a durable objective works best
when it names what done means, how to prove it, what not to break, what's in
scope, and when to stop).

A contract has five optional fields — outcome, verification, constraints,
boundaries, stop_when. When set, the continuation prompt tells the agent to
target the verification surface and respect constraints, and the judge marks
the goal done only when the verification criterion is met with concrete
evidence (command result, file excerpt, test output) instead of a loose
"looks done" claim. This tightens the most common /goal failure mode:
premature completion / endless over-continuation on an underspecified goal.

Two ways to set a contract, both backward compatible (bare /goal <text>
behaves exactly as before):
- /goal draft <objective>  — expands plain text into a full contract via the
  goal_judge aux model (cache-safe side call), falls back to a free-form goal
  if the model is unavailable.
- /goal <text> with inline 'field: value' lines (verify:, constraints:,
  boundaries:, stop when:, ...). Plain goals with an incidental colon are not
  mangled — only known field prefixes are pulled out.
- /goal show prints the active contract.

Contracts persist in SessionDB.state_meta alongside the goal (survive /resume),
compose with /subgoal criteria, and old goal rows load unchanged. CLI + every
gateway platform via the shared GoalManager engine; zero new model tools.

Tests: +18 in tests/hermes_cli/test_goals.py (parse/serialize/judge-prompt/
draft/fallback), 73/73 green; 42/42 across the broader goal test surface;
live E2E roundtrip (set -> persist -> reload -> contract-aware prompts) green.
2026-06-22 12:20:09 -07:00
Teknium
ff85af3fc7
feat(goals): /goal wait <pid> — park the loop on a background process (#50503)
* feat(goals): add /goal wait <pid> barrier to park the loop on a background process

The /goal loop re-pokes the agent every turn via the post-turn judge. When a
goal is gated on a long-running background process (CI poller, build, test
matrix, deploy) that produces nothing to judge yet, this spins the agent into
'is it done?' busy-work and burns the turn budget.

/goal wait <pid> [reason] parks the loop: while the PID is alive, the judge is
skipped, no turn is consumed, no continuation fires, and /goal status shows a
parked indicator. The barrier auto-clears the moment the process exits (the
agent's notify_on_complete watcher is the natural wake signal), then the next
turn resumes normal judging. /goal unwait clears it manually; pause/resume/clear
drop it; a dead/stale PID can never wedge the loop.

Wired across CLI, gateway, and the mid-run command guard for parity. Barrier
persists in SessionDB.state_meta (survives /resume); GoalState gains
backward-compatible waiting_on_pid/waiting_reason/waiting_since fields. 12 new
tests; docs updated.

* fix(goals): use gateway.status._pid_exists for liveness, not os.kill(pid,0)

The Windows-footguns CI guard flagged os.kill(pid, 0) in _pid_alive — on
Windows that's not a no-op, it routes to CTRL_C_EVENT and hard-kills the
target's console process group (bpo-14484). Delegate to the canonical
footgun-safe gateway.status._pid_exists (psutil + ctypes/POSIX fallback)
instead, with a direct-psutil last resort.

* feat(goals): judge-driven auto-wait — the loop parks itself, no manual /goal wait

Makes the wait barrier automatic. Every turn the judge is shown the agent's
live background processes (pid, command, uptime, output tail from the
process_registry) alongside the goal + response, and can return a new 'wait'
verdict instead of continue:
  {"verdict":"wait","wait_on_pid":N}      → park until that process exits
  {"verdict":"wait","wait_for_seconds":N} → park until the deadline passes
evaluate_after_turn acts on the directive (sets the barrier, parks the loop)
so the agent isn't re-poked into busy-work while CI/builds/deploys run. Adds a
time-based waiting_until barrier alongside the pid barrier; both auto-clear and
can never wedge the loop. Drivers (CLI, gateway, tui_gateway) feed the live
registry in via gather_background_processes(). Manual /goal wait stays as an
override. Judge verdict contract widened to (verdict, reason, parse_failed,
wait_directive); legacy {"done":bool} shape still accepted.

* test(goals): update kanban _fake_judge to the 4-tuple judge contract

CI test(3) caught it: test_kanban_goal_mode's _fake_judge still returned the
3-tuple (verdict, reason, parse_failed), but the kanban loop now unpacks the
4-tuple (+ wait_directive). Update the fake to return None for the directive
and accept the background_processes kwarg.

* feat(goals): trigger-based wait — park on a process's own signal, not just exit

Addresses two gaps in the judge-driven wait: (1) the judge could only express
'wait until PID exits' or 'wait N seconds', so a long-lived watcher/server that
fires a trigger MID-RUN (and may never exit) couldn't be waited on; (2) the
process's own watch_patterns/notify_on_complete trigger was invisible to the judge.

Adds a session-based barrier (waiting_on_session) that releases on the process's
OWN trigger via process_registry.is_session_waiting(): the session exits, OR (if
started with watch_patterns) its pattern matches — even while the process keeps
running. list_sessions() now surfaces session_id + watch_patterns/watch_hit/
notify_on_complete so the judge sees the trigger and is told to prefer
wait_on_session for trigger processes. Judge verdict gains a {wait_on_session}
directive (preferred over pid). Backward-compatible GoalState field; pid + time
barriers unchanged.

Tests: TestSessionTriggerBarrier (release on mid-run pattern match while alive,
release on exit, unknown-session, full park→trigger→resume, parse, validation,
backcompat load). 105 goal-surface + 85 process_registry tests green.
2026-06-22 06:27:29 -07:00