hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-04 12:33:08 +00:00

Author	SHA1	Message	Date
teknium1	f2ca3e3d84	fix(gateway): hold _run_restart on _restart_task + explicit cancel-loop skip Follow-up on the cherry-picked #13173 fix. Holds the _run_restart task in self._restart_task (a bare asyncio.create_task keeps only a weak reference, so a still-pending task can be GC'd mid-flight) and explicitly skips it in the _stop_impl cancel loop alongside _stop_task. Adds AUTHOR_MAP entry for the contributor and a regression test that fails when the task is cancellable. Refs #12875	2026-06-27 03:57:31 -07:00
zeapsu	1ce5d6d974	fix(gateway): exclude _run_restart from _background_tasks to prevent zombie on /restart When request_restart() adds _run_restart to _background_tasks, _stop_impl later cancels all entries in that set. Since _run_restart is awaiting _stop_task at that point, the CancelledError propagates into _stop_impl, interrupting cleanup before _shutdown_event.set() and _exit_code = 75 execute. This leaves the gateway as a zombie (alive but disconnected) or exiting with code 0 instead of 75, preventing systemd Restart=on-failure from restarting the service. Fix: don't add _run_restart to _background_tasks — it self-terminates in ~50ms and needs no lifecycle management. Fixes #12875	2026-06-27 03:57:31 -07:00
teknium1	4e0788783b	refactor(gateway): extract MoA one-shot restore helper; restore #28686 comment; real-method tests Follow-up on the salvaged MoA restore fix: - Extract the finally-block restore into _restore_moa_one_shot() so the behavior is unit-testable without re-implementing it, and so the gateway /moa handler and the finally block share one implementation. - Restore the load-bearing #28686 zombie-eviction comment above _release_running_agent_state that the original diff dropped. - Rewrite the tests to call the real _restore_moa_one_shot helper (the originals re-implemented the restore logic inline, so they passed regardless of the production code).	2026-06-27 03:43:28 -07:00
srojk34	2f29e3cfc5	fix(gateway): restore MoA one-shot model override on failed turns The MoA one-shot restore ran inside the try block after _handle_message_with_agent returned. When that call raised an exception (agent init failure, interpreter shutdown, OOM), the restore was skipped and the MoA model override stayed permanently on _session_model_overrides — silently routing all subsequent messages through the MoA reference fan-out with no user-visible indication. Move the restore to the finally block so it fires on every exit path (success, exception, interrupt). The restore data lives on the per-turn event object and would be lost if not consumed here.	2026-06-27 03:43:28 -07:00
teknium1	50f6855217	feat(moa): make /moa one-shot only; route preset switching through the model picker /moa no longer does a sticky model switch. It now always runs a single prompt through the default MoA preset and restores the prior model afterward; the whole argument is the prompt (no preset-name matching). To switch to a MoA preset for the session, select it from the model picker, where presets already surface under a virtual Mixture of Agents provider on every model-selection surface. Also fixes #53444: the TUI one-shot only set session[model_override], which the already-built cached agent ignored, so MoA silently never ran and the turn used the original model. The TUI now does a real in-place agent.switch_model() via _apply_model_switch() when a live agent exists (with a proper restore after the turn), and falls back to a model_override for lazy/unbuilt sessions. Removes the redundant sticky-switch branch from the CLI, gateway, and TUI /moa handlers; updates the command description, usage string, and docs.	2026-06-27 03:09:09 -07:00
briandevans	57864d07ed	fix(gateway): suppress operational status/error noise on all chat gateways, not just Telegram (#39293 ) The Telegram noise/secret filter added in #28533 gated its work on `_gateway_platform_value(platform) != "telegram"`, so `_sanitize_gateway_final_response` and `_prepare_gateway_status_message` only ran for Telegram. Every other human-facing chat surface (WhatsApp, Discord, Slack, Signal, Matrix, plugin platforms, etc.) received raw provider-error bodies verbatim — including any leaked credentials the secret-redaction pass (`sk-…`, `Bearer …`, `gh[pousr]_…`, `xox[baprs]-…`, `hf_…`, `glpat-…`) was meant to strip. Invert the gate from a one-platform allowlist into a small programmatic-surface denylist: only `local`, `api_server`, `webhook`, and `msgraph_webhook` consume gateway text programmatically and keep raw status/error text. Every other (chat) surface — including unknown/empty platform values and on-demand plugin pseudo-members — fails closed to the redacted, noise-filtered, sanitized path. This widens the same root-cause fix to both call sites: status callbacks and final replies.	2026-06-27 04:47:10 +05:30
kshitijk4poor	b0f44d3fad	fix(gateway): remove process-global HERMES_SESSION_KEY write that misroutes approval prompts across concurrent sessions GatewayRunner._run_agent's run_sync() wrote the per-turn session key to the process-global os.environ["HERMES_SESSION_KEY"]. Because os.environ is shared across the whole process, concurrent gateway sessions (e.g. two Discord threads) clobbered each other's value. A tool worker thread whose approval contextvar was unset then fell back to os.environ via get_current_session_key() and read whichever session ran run_sync() last — routing "Command Approval Required" prompts to the wrong thread. Session routing is already concurrency-safe via contextvars: - gateway/session_context.py _SESSION_KEY (set in set_session_vars) - tools/approval.py _approval_session_key (set via set_current_session_key right before the agent runs, inherited by tool worker threads) The only non-test readers of HERMES_SESSION_KEY (tools/approval.py, tools/terminal_tool.py, tools/kanban_tools.py) all prefer the contextvar with os.environ as a mere fallback. CLI/cron/TUI set their own os.environ via separate export paths (e.g. the TUI parent exporting it into the agent subprocess), so removing this in-process write does not affect them. Adds regression tests asserting the resolver prefers the contextvar and does not leak a concurrent session's cleared/clobbered os.environ value. Closes #24100 Co-authored-by: Yosapol Jitrak <yosapol@jitrak.dev>	2026-06-27 04:31:37 +05:30
Ben	8ab7246c45	fix(gateway): stamp drain marker with instantiation epoch so a durable-volume restart clears it (NS-570) The external-drain marker .drain_request.json is written under HERMES_HOME, which on Hermes Cloud is a persistent Fly volume (/opt/data). A begin-drain marker therefore SURVIVES the post-update machine restart. But the disruptive lifecycle actions a drain protects (auto-update / image migrate / env edit / profile change) all restart the machine — which is exactly the signal the drain is over. The freshly-restarted gateway re-read the orphaned marker on its startup reconcile and parked itself back in 'draining', refusing every new turn indefinitely (NS-570: ~52 min until manually cleared). Fix: stamp the marker with an identity of THIS container/VM instantiation (kernel boot_id + PID 1 start time, read from /proc) and treat a marker whose epoch differs from the current instantiation as absent. A deliberate restart → new PID 1 → new epoch → stale marker ignored → gateway boots 'running'. A marker written during the current instantiation (the live drain) still matches; an s6 respawn of just the gateway (PID 1/init unchanged) keeps the same epoch, so an in-flight drain is still honoured (D4a reversibility preserved). The staleness check is lenient and never fail-closed: a legacy marker with no epoch, a corrupt/contentless marker, or an environment with no /proc (epoch unavailable) all degrade to the original presence-only behaviour. NAS is untouched — it only ever POSTs begin/cancel-drain over HTTP; the marker file is purely gateway-internal IPC. The fix is entirely within gateway/drain_control.py; the watcher and the dashboard endpoint go through the same drain_requested()/write_drain_request() chokepoints and need no functional change.	2026-06-26 18:59:41 +05:30
Ben	19b2624404	feat(gateway): external drain trigger + accept-gating (begin/cancel + control channel) Tasks 2.1 + 2.2 + 2.3 of the safe-shutdown plan — the reversible quiesce-without-restart machinery NAS drives during a lifecycle action (D4a). These ship together because the endpoint, the control channel, and the gateway state machine are one coherent slice. 2.2 — control channel (gateway/drain_control.py, new): The dashboard has no HTTP path into a running gateway (guardrails: "there is NO external control channel into a running gateway"); restart/drain is driven only by markers the gateway reacts to. So begin/cancel-drain writes/removes a presence-based marker .drain_request.json (HERMES_HOME-scoped, atomic write, never-raises read; a corrupt marker reads as present-contentless → fail-safe toward quiescing). This is Q-B option A. 2.2 — gateway state machine (gateway/run.py): - _external_drain_active flag, DISTINCT from the shutdown _draining flag: this one does NOT exit the process and is fully reversible. - _enter_external_drain / _exit_external_drain: idempotent transitions that flip gateway_state→draining / →running via _update_runtime_status (preserving the live active_agents count). exit refuses to revert to running during a real shutdown or after the loop stops (shutdown wins). - _drain_control_watcher: 1s background task (modelled on _handoff_watcher) reconciling accept-state with the marker; honours a marker that survived a restart on its first tick. Registered alongside the other watchers in start. - New-turn accept gate in _handle_message, placed BEFORE the session-slot claim: when draining, refuse to START a new turn (so active_agents can only fall → no TOCTOU race), while in-flight turns finish untouched. Internal/ system events (restart-recovery replays, bg-process completions) bypass it. 2.1 — endpoint (hermes_cli/web_server.py): POST /api/gateway/drain {action: drain\|cancel}. Authenticated by the Task-2.0a token seam (the drain plugin registered this exact path as a token route); attributes the request to the verified token principal. Begin writes the marker, cancel removes it — the gateway process owns the actual transition. Force-override (D6) is NOT here; it maps onto the existing immediate /api/gateway/restart force path. Tests (mocked — necessary-not-sufficient; the HARD live gate Q-B is next): - tests/gateway/test_external_drain_control.py — marker contract (write/clear/ read/corrupt/atomic), state machine (enter/exit/idempotency/shutdown-wins/ loop-stopped), watcher reconcile-enter-then-exit, new-turn refusal, and in-flight-not-interrupted. 15 tests. - tests/hermes_cli/test_web_server.py — /api/gateway/drain begin/default-begin/ cancel/cancel-idempotent/bad-action-400. 6 tests. - dashboard.drain_auth config section already added in 2.0b commit. All touched suites green: 301 (gateway+auth) + 9 (web_server endpoints) passed. Intentionally deferred: - HARD live-validation gate (Q-B): real isolated `hermes gateway run`, drive a real begin-drain marker, prove the 5-point checklist a–e. - Spec-doc status flip + Phase-2 PR. Build status: external-drain, restart-drain, status, dashboard-auth, drain-plugin, token-auth, and web_server-endpoint suites green.	2026-06-26 00:47:19 -07:00
teknium1	43b8ba4181	fix(telegram): preserve Bot API update queue on watcher reconnect After a prolonged outage the in-process network-error ladder escalates to fatal and GatewayRunner._platform_reconnect_watcher rebuilds a fresh adapter that reconnects through the bootstrap path. That path called start_polling(drop_pending_updates=True), discarding every update Telegram queued during the outage — all messages sent while the bot was down were silently lost. The in-process ladder and 409-conflict handler already passed drop_pending_updates=False; only bootstrap did not distinguish a cold first boot from a reconnect. Thread an is_reconnect signal from the watcher through _connect_adapter_with_timeout into adapter.connect(). The base BasePlatformAdapter.connect() gains a keyword-only is_reconnect=False so every adapter inherits a tolerant signature (no per-platform breakage when the runner forwards the kwarg). Telegram translates is_reconnect into drop_pending_updates=not is_reconnect on both the polling and webhook bootstrap calls. Cold boot still drops the stale queue; a watcher reconnect preserves it. Fixes #46621. Co-authored-by: annguyenNous <annguyen@nousresearch.com> Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com> Co-authored-by: Kewe63 <Kewe63@users.noreply.github.com>	2026-06-25 21:29:57 -07:00
Teknium	0b7128582f	fix(state): detect and repair FTS write corruption that silently drops gateway history (#52798 ) A readable state.db can still reject every message write through the messages_fts* triggers when the FTS5 index is corrupt: base-table reads and PRAGMA integrity_check pass, but INSERT INTO messages fails with 'database disk image is malformed'. The gateway reloads conversation_history from disk each turn, so a silently-failed write hands the next turn stale/empty history even though the same cached AIAgent still holds the live transcript — causing immediate same-session amnesia. (#50502) - hermes_state.py: _db_opens_cleanly() now drives a rolled-back message write through the FTS triggers, so write-only corruption (which the read-only probe reported healthy) is detected. repair_state_db_schema() gains an in-place FTS5 'rebuild' strategy (tier 0) before the dedup/drop tiers, plus an already_healthy short-circuit. Both 'hermes sessions repair' and 'hermes doctor' route through these, so the fix covers the whole class. - hermes_cli/doctor.py: the state.db check runs the write-health probe even on the success (readable) path and repairs in place with --fix. - gateway/run.py: _select_cached_agent_history() prefers the cached agent's longer live _session_messages over a shorter persisted transcript, so an FTS write failure can't wipe in-session context. - tests: regressions for write-health detection, in-place repair preserving rows + resuming writes, the already_healthy shortcut, and the gateway guard. Combines the approaches from #50504 (@0-CYBERDYNE-SYSTEMS-0, issue author), #52165 (@davidgut1982), and #50576 (@trevorgordon981).	2026-06-25 21:18:41 -07:00
Ben Barclay	dedf5643d8	fix(gateway): scale-to-zero never armed — arm-gate counted disabled placeholder platforms (#52831 ) The scale-to-zero idle watcher never started on a correctly-opted-in, relay-only instance, so the gateway never ran its idle decision, never called go_dormant(), and never sent going_idle to the connector. Fly's autostop still suspended the machine on traffic-idle, but the connector never flipped the instance to buffered-only — so an inbound DM took the live delivery path, found no live session for the suspended machine, and was dropped fail-closed with no wake poke. The machine slept and never woke. Root cause: _scale_to_zero_should_arm() passed list(config.platforms.keys()) to messaging_is_relay_only_or_absent(). config.platforms is pre-seeded with a DISABLED placeholder PlatformConfig for every known platform (telegram, discord, slack, matrix, …), so the key set is always the full ~20-entry catalog regardless of what the instance actually runs. The relay-only check discarded "relay", saw the disabled placeholders as live direct-socket platforms, and returned False — so should_arm() was False and the watcher was never created. Verified live on a staging instance: config.platforms keys = [telegram, discord, slack, mattermost, matrix, relay] with only relay enabled=True; should_arm() = False. Fix: filter config.platforms to ENABLED entries before the relay-only check, mirroring the adapter-connect loop which already gates on `if not platform_config.enabled: continue`. This arms off the same notion of "active platform" the rest of start() already uses — no parallel concept. Also add a one-line not-armed diagnostic: when an instance IS opted in (the HERMES_SCALE_TO_ZERO stamp is set) but the watcher still doesn't arm, log why (relay_only_or_absent, the enabled platforms, wake_url present/missing). A non-opted instance stays silent. The arm path previously logged only on success, so a failed arm was invisible. Tests: the existing pure-helper tests passed bare names so they never exercised the call site that feeds the placeholder-laden config. Add behaviour-contract tests against the REAL _scale_to_zero_should_arm with a realistic config.platforms (relay enabled + others disabled). The F25 regression test (relay-only + disabled placeholders must arm) and the no-platform case are RED without this fix, GREEN with it; the genuinely-enabled-direct-platform / not-opted-in / no-wake-url cases stay correctly non-arming so the filter can't over-broaden. Wake mechanism itself verified healthy independently (direct wakeUrl GET resumed a suspended staging instance in 1.15s, clean resume signature).	2026-06-26 14:01:48 +10:00
Teknium	811df74a10	fix(gateway): defer cross-process cache cleanup off the cache lock (#52197 ) (#52761 ) The #45966 cross-process coherence guard popped the stale cached agent and then called the blocking _cleanup_agent_resources (memory-provider shutdown, tool-resource teardown, async-client teardown) while still holding _agent_cache_lock, on the gateway event-loop thread. While that ran, _sweep_idle_cached_agents (driven by _session_expiry_watcher) blocked acquiring the same lock and the asyncio loop stalled for minutes, tripping repeated Discord 'heartbeat blocked' warnings. Fix mirrors the cap-enforcer / idle-sweep paths: pop the stale entry under the lock, release it, then schedule the SOFT release on a daemon thread. The soft path (_release_evicted_agent_soft) is also more correct here than the hard teardown the regression used — the same session rebuilds a fresh agent immediately after invalidation, so its terminal sandbox / browser / bg processes (keyed on task_id) must be preserved for the rebuilt agent to inherit, not torn down. Verified the cross-process site was the only cleanup-under-lock instance; the other _cleanup_agent_resources call sites run outside the lock.	2026-06-25 18:58:47 -07:00
Gille	e7d2f0b93c	fix(windows): suppress console flashes and harden gateway restarts	2026-06-25 14:42:38 -07:00
Teknium	c6575df927	feat(moa): expose MoA presets as selectable virtual models (#46081 ) * feat(moa): expose MoA presets as selectable virtual models Reconstructed onto current main (PR #46081's base had diverged with no common ancestor, marking the PR dirty so CI never dispatched). MoA is now a virtual provider: each named preset is a selectable model under provider 'moa', and the preset's aggregator is the acting model that answers and calls tools. Reference models fan out in parallel via a bounded ThreadPoolExecutor (the same batch pattern delegate_task uses) — all references dispatched at once, collected when every one finishes, then handed to the aggregator. Output order is preserved, failures and the MoA-recursion guard stay isolated per reference. - Removed the old mixture_of_agents model tool and moa toolset. - Added moa as a virtual provider in the provider/model inventory. - /moa is shortcut behavior over model selection (default preset / named preset / one-shot prompt). - Dashboard + Desktop manage named presets; presets appear in model pickers. - Parallel reference fan-out in agent/moa_loop.py with regression test. * fix(moa): thread moa_config through _run_agent to _run_agent_inner The reconstructed gateway MoA wiring declared moa_config on _run_agent (the profile-scoping wrapper) and used it inside _run_agent_inner, but the wrapper never forwarded it — _run_agent_inner had no such parameter, so the runtime hit NameError: name 'moa_config' is not defined on the compression-failure session sync path. Add moa_config to _run_agent_inner's signature and forward it from both wrapper call sites (multiplex and non-multiplex). Caught by tests/gateway/test_compression_failure_session_sync.py on CI shard test(4). * fix(moa): classify moa as a virtual provider in the catalog The moa virtual provider has no PROVIDER_REGISTRY/ProviderProfile entry, so provider_catalog() fell through to the default auth_type="api_key" with no env vars — tripping two catalog invariants: - test_provider_catalog: api_key providers must expose a credential env var - test_provider_parity: every hermes-model provider must be desktop-configurable moa already declares auth_type="virtual" in HERMES_OVERLAYS; consult that overlay as an auth_type fallback so the catalog reports moa as virtual (no real credential, no network endpoint). Exempt virtual providers from the desktop parity union check the same way 'custom' is exempt — derived from the catalog, not a hardcoded slug, so future virtual providers are covered too.	2026-06-25 13:52:06 -07:00
srojk34	510bf40705	fix(gateway): read compaction result flag not config flag in hygiene guard (#50098 ) Salvage of #50098 by @srojk34, cherry-picked onto current main. The hygiene auto-compress guard and the /compress slash command both read compression_in_place (config flag — is in-place mode enabled?) instead of _last_compaction_in_place (result flag — did in-place compaction actually succeed?). Both agents are built without a session_db, so archive_and_compact always fails silently and _last_compaction_in_place stays False. Reading the config flag makes the guard think in-place succeeded, triggering rewrite_transcript() which replaces the original messages with only the compressed summary — permanent data loss. Co-authored-by: srojk34 <srojk34@users.noreply.github.com>	2026-06-25 12:56:05 -07:00
kshitijk4poor	73c8d5a1e7	fix: use self._session_db directly + add regression test - Replace getattr(self.session_store, '_db', None) with self._session_db (the GatewayRunner's own SessionDB, consistent with existing usage in slash_commands.py L240/L499). - Remove verbose comment referencing a branch name as an issue number. - Update stale comment in run.py that said 'today it has no session_db'. - Add regression test verifying session_db is passed and rotated session is persisted (adapted from #51624 by @LeonSGP43). - Add _session_db=None to _make_runner fixtures in test_compress_command, test_compress_focus, and test_compress_plugin_engine.	2026-06-26 00:50:40 +05:30
Omar B	1a38a8ff7d	fix(gateway): pass session_db to compress temp agents so persistence works Manual /compress and session hygiene auto-compress both create temporary AIAgent instances to run compression. These agents were created without a session_db, so compress_context computed the compressed messages in memory, rotated the session ID, and reported success — but never wrote to the database. The next user message reloaded the original full transcript, making compression appear to do nothing. Fix: pass session_db=self.session_store._db to both temp agents so the session rotation is properly persisted. Also set _end_session_on_close on the /compress temp agent (already done in hygiene path) to prevent cleanup from ending the newly rotated session.	2026-06-26 00:50:40 +05:30
kshitij	5de8a8fbe8	Merge pull request #52375 from NousResearch/salvage/47237-dedupe-user-turns fix(gateway): dedupe user turns on transient failure (#47237)	2026-06-26 00:30:59 +05:30
davidgut1982	6208d6b3be	fix(gateway): dedupe user turns on transient failure (#47237 ) When the gateway persists a user message after a transient provider failure (429/timeout/auth error), subsequent retries of the same Telegram message could stack duplicate user turns in the transcript, causing the agent to fall behind by 1-2 messages. Add has_platform_message_id() to SessionDB (using the existing idx_messages_platform_msg_id partial index) and a SessionStore wrapper. The gateway's transient-failure path checks this before append_to_transcript -- if the platform_message_id is already persisted, the duplicate write is skipped. Salvaged from #47869 by @davidgut1982. Adapted to current main which has additional append sites and an existing content-based dedupe in the exception handler path. Closes #47237	2026-06-26 00:11:17 +05:30
Ben Barclay	d6269da7fd	fix(gateway): harden scale-to-zero dormancy guards (#52359 ) Some checks are pending CI / detect (push) Waiting to run Details CI / tests (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / typecheck (push) Blocked by required conditions Details CI / docs-site (push) Blocked by required conditions Details CI / history-check (push) Blocked by required conditions Details CI / contributor-check (push) Blocked by required conditions Details CI / uv-lockfile (push) Blocked by required conditions Details CI / docker-lint (push) Blocked by required conditions Details CI / supply-chain (push) Blocked by required conditions Details CI / osv-scanner (push) Blocked by required conditions Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Block scale-to-zero suspend while background async delegations are active, and restore runtime status to running on real inbound after a dormant wake.\n\nAdd regression coverage for both review findings.	2026-06-25 20:41:03 +10:00
Ben	d1cac0e5ef	feat(gateway): scale-to-zero idle detection + dormant-quiesce (Phase 0) The gateway-side BEHAVIOUR layer that consumes the relay scale-to-zero primitives (gateway-gateway Phase 5): the gateway decides it is idle and drives the relay transport dormant so the platform (Fly autostop:"suspend") can suspend the now-traffic-idle machine, which wakes on the connector's wakeUrl poke (decisions.md Q3=C', D1-D13). - gateway/scale_to_zero.py: pure helpers — scale_to_zero_enabled (the NAS Labs HERMES_SCALE_TO_ZERO stamp, D11/Q8=A), parse_idle_timeout_seconds (config.yaml gateway.scale_to_zero.idle_timeout_minutes, D2), messaging_is_relay_only_or_absent (F6/D1), should_arm (D1/D11/§3.4(1)), is_idle (D2/D3/F7). - gateway/run.py: _last_inbound_at clock stamped on user inbound in _handle_message (F13); the arm-gate + idle predicate + the _scale_to_zero_watcher dormant sequence (mark draining -> adapter go_dormant() -> cooldown), started only when armed. Deliberately NOT the stop path and NOT mark_resume_pending (F12/D13). - tools/process_registry.py: has_any_active() for the bg-work guard (D3/F7). - hermes_cli/config.py: gateway.scale_to_zero.idle_timeout_minutes default 5. Tests: 38 pure-logic + 6 watcher (incl. bg-work regression guard proven RED). Full relay + scale-to-zero suites: 184 passed. The 20 unrelated failures in the broader run are PRE-EXISTING on origin/main (custom-provider/tools tests), confirmed via a pristine baseline worktree.	2026-06-24 18:47:18 -07:00
kshitijk4poor	e0272cfef2	Revert "fix(compression): make minimum context floor configurable (#31600 )" This reverts commit `cae1ee44a7`.	2026-06-25 01:04:44 +05:30
Tranquil-Flow	cae1ee44a7	fix(compression): make minimum context floor configurable (#31600 ) Add compression.minimum_context_floor config key that allows users to lower the compression threshold floor below the hardcoded 64K default, preventing infinite tool-call loops on models whose structured output degrades well before 64K tokens. - agent/model_metadata.py: add get_configurable_minimum_context() helper with 16K hard safety limit - agent/context_compressor.py: accept minimum_context_floor param, thread it through _compute_threshold_tokens - agent/conversation_compression.py: use compressor's floor for aux model context validation - agent/agent_init.py: read compression.minimum_context_floor from config and pass to ContextCompressor - gateway/run.py: cache-busting includes new key Salvaged from #31686 by @Tranquil-Flow onto current main. Resolves conflicts with in-place compaction (#38763) and max_tokens threshold computation (#43547) that landed after the original PR. Closes #31600	2026-06-25 00:56:04 +05:30
sweetcornna	b41d9b845d	fix(gateway): surface retry hint instead of silently dropping turn after /stop (#31884 ) After /stop, the next user message can hit a stale generation token and return with api_calls=0, no failure, no interruption. _normalize_empty_agent_response fell through to an empty string, so the gateway logged "response=0 chars" and sent nothing — the message was silently lost while internal work sometimes continued. Add the api_calls==0 / not-failed / not-interrupted / not-partial branch to the single normalization chokepoint so the user gets a short retry hint instead of silence. Regression test asserts the hint surfaces. Salvaged from #33851 (re-applied on current main; original was 1401 commits behind and the function had moved).	2026-06-24 23:51:31 +05:30
kshitij	ae20c3fb90	Merge pull request #51025 from NousResearch/salvage/cron-autoreset-override fix(gateway): consume was_auto_reset so /model survives session auto-reset (#48031)	2026-06-24 19:20:11 +05:30
x7peeps	6879d77d74	fix(gateway): consume was_auto_reset so /model survives session auto-reset When `/model X` is the FIRST message after an idle/daily/suspended auto-reset, the slash-command path stores a session model override but leaves `session_entry.was_auto_reset = True` (it never passes through `_handle_message_with_agent`, which is where the flag was consumed). On the NEXT regular message, the auto-reset cleanup block pops the freshly-stored model/reasoning override BEFORE the flag is consumed — so the switch is silently lost and resolution falls back to the config default, while the session DB still shows the switched model (a two-sources-of-truth divergence). Consume the flag at both sites: 1. gateway/run.py — capture `was_auto_reset` into a local and set the attribute False immediately at the top of the cleanup block, so the cleanup can't re-fire on a later message and wipe an override stored between turns. Downstream reads use the captured local. 2. gateway/slash_commands.py — the model path consumes the flag before storing the override, so a /model-first-after-auto-reset isn't wiped by the next message's cleanup. Salvaged from #48062 by x7peeps (authorship preserved). Tests: tests/gateway/test_48031_model_switch_after_auto_reset.py — AST invariants pinning both consume sites (load-bearing; verified they fail when either consume is removed). Mirrors the AST-pin approach in test_35809_auto_reset_clean_context.py. Gateway session/reset suite: 16 passed. Fixes #48031	2026-06-24 19:12:44 +05:30
r266-tech	f0c5d812b0	fix(gateway): offload handoff watcher SessionDB polling off the event loop The Discord gateway heartbeat stalled ('Shard ID None heartbeat blocked for more than N seconds') because _handoff_watcher polled the synchronous, blocking SQLite-backed SessionDB directly on the asyncio event loop every 2s. Each list_pending/claim/complete/fail call performed blocking disk I/O on the loop thread, starving the Discord heartbeat coroutine. Wrap every blocking SessionDB call inside the watcher loop in asyncio.to_thread(...) so the SQLite work runs on a worker thread and the event loop (and heartbeat) stays responsive. These four call sites are the only synchronous self._session_db.* calls inside the watcher loop body. Adds tests/gateway/test_handoff_watcher_async_db.py asserting the watcher offloads its SessionDB calls via asyncio.to_thread (mutation-survivable: reverting any to_thread wrap fails the corresponding assertion). Fixes #40695 Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>	2026-06-24 18:40:23 +05:30
Ben	c93b9f9057	feat(relay): terminal 4401 (opt-out) → clean "Relay disabled" state Some checks are pending CI / detect (push) Waiting to run Details CI / tests (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / typecheck (push) Blocked by required conditions Details CI / docs-site (push) Blocked by required conditions Details CI / history-check (push) Blocked by required conditions Details CI / contributor-check (push) Blocked by required conditions Details CI / uv-lockfile (push) Blocked by required conditions Details CI / docker-lint (push) Blocked by required conditions Details CI / supply-chain (push) Blocked by required conditions Details CI / osv-scanner (push) Blocked by required conditions Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Phase 7 Unit 7d-B. When an operator opts an instance OUT of the Team Gateway relay (Unit 7b deprovision), the connector revokes the per-gateway secret and closes the gateway's WS with 4401. The reconnect supervisor previously treated EVERY close as retryable, so the live process spun "retrying 4401" forever and the dashboard showed a red error — opt-out looked like a failure. Now a 4401 close that arrives AFTER a successful handshake is recognized as a terminal credential revocation: - ws_transport.py: track `_handshake_succeeded` (set when a descriptor is received); on a 4401 close after a prior success, latch `auth_revoked` and do NOT spawn the reconnect supervisor. A 4401 BEFORE any successful handshake stays retryable (cold-start / not-yet-provisioned race, not a revocation). New `auth_revoked` property + a websockets-version-safe close-code reader (prefers `.rcvd`/`.sent` Close frames; `.code` is deprecated in websockets 13+). - adapter.py: a revocation monitor turns `transport.auth_revoked` into a clean, NON-retryable `relay_disabled` fatal and notifies the gateway's fatal-error handler (so the adapter is removed and NOT queued for reconnection — the credential is dead until the instance is recreated). Monitor is cancelled on disconnect; only started when the transport exposes `auth_revoked` (prod WS). - run.py: `_handle_adapter_fatal_error` maps the `relay_disabled` code to a `disabled` platform_state (not `fatal`/`retrying`). - web: PlatformsCard renders the `disabled` state with a neutral outline badge, a PowerOff icon, and muted (not destructive-red) text + message. New optional `status.disabled` i18n string ("Disabled"). Also bundles the Phase 7 contract-doc update (this doc is authoritative in hermes-agent): docs/relay-connector-contract.md gains an "Author-first resolution + the account-link (DM) path" section documenting the multi-tenant-guild rule (D-7.2 — route by authenticated author binding, never by guild; unlinked → fail-closed), the `/link <code>` DM flow, and the connector-authoritative opt-out + terminal-4401 behavior this PR implements. Tests: +2 ws_transport (4401-after-handshake terminal / no-reconnect; 4401-before-handshake stays retryable) and +2 adapter (revocation → non-retryable relay_disabled fatal + handler fired; no-revocation → no fatal). 138 relay tests pass (incl. the contract-doc conformance test); ruff clean; web tsc clean. Phase 7 Unit 7d-B (relay-adapter solo lane). Q17 → Option 2; Option 3 (live de-register, no recreate) + the restart-re-provision hole deferred post-alpha.	2026-06-24 18:43:01 +10:00
teknium1	366c2a3766	fix(gateway): propagate fatal-config exit code through start_gateway clean-exit path The contributor PR stamped runner._exit_code=78 on non-retryable startup errors, but start_gateway()'s clean-exit branch returned True before the SystemExit(runner.exit_code) site, so main() exited 0. The s6 finish script's [ "$1" = "78" ] check never matched and s6 crash-looped the gateway anyway — the fix was dead as shipped (#51228). Honor runner.exit_code in the clean-exit branch: raise SystemExit(code) when set, else return True (normal /restart clean exit). Add a start_gateway()-level test that asserts process-level SystemExit(78) propagation — the gap the PR's object-level test missed — plus exit_code on the existing _CleanExitRunner mocks.	2026-06-24 16:34:51 +10:00
Francesco Mucio	776f68e1ee	fix(gateway): exit 78 (EX_CONFIG) on fatal startup errors, s6 finish script stops restart loop Profiles without their own messaging token inherit the default profile's token via os.getenv, hit a token collision, and exit with startup_failed. s6 restarts them immediately, creating ~30MB tirith sandbox dirs in /tmp each cycle — filling the disk in hours (#51228). Changes: - gateway/restart.py: add GATEWAY_FATAL_CONFIG_EXIT_CODE = 78 - gateway/run.py: set exit_code=78 on non-retryable startup errors (token collision, no platforms) - hermes_cli/service_manager.py: add _render_finish_script() that translates exit 78 → exit 125 (s6 permanent failure) - hermes_cli/container_boot.py: write finish script alongside run script during profile registration The s6 finish script pattern follows docker/s6-rc.d/dashboard/finish. Closes #51228	2026-06-24 16:34:51 +10:00
helix4u	06cbc3bae9	fix(photon): recover degraded upstream stream	2026-06-23 21:33:10 -07:00
fyzanshaik	0ba1dfed78	fix(gateway): refuse model switch on stale checkout to avoid env_float ImportError	2026-06-24 04:16:54 +05:30
Teknium	e32ebc6aa2	feat(skills): /learn — distill a reusable skill from anything you describe (#51506 ) Open-ended skill learning across every surface. /learn <free text> takes a description of any source — a directory, a URL, the workflow you just walked the agent through, or pasted notes — and the live agent gathers it with the tools it already has (read_file/search_files, web_extract, the conversation, the pasted text), then authors a SKILL.md via skill_manage following the house authoring standards (<=60-char description, the standard section order, Hermes-tool framing, no invented commands). No engine, no model-tool footprint, works on any terminal backend (local, Docker, remote): /learn builds a standards-guided prompt and hands it to the agent as a normal turn. - agent/learn_prompt.py: shared standards-guided prompt builder - /learn registry entry (both surfaces) + CLI handler (inject onto input queue) + gateway handler (rewrite turn, fall through, /blueprint pattern) - tui_gateway command.dispatch returns a send directive -> TUI + dashboard chat - dashboard Skills page 'Learn a skill' panel (dir + URL + open-ended text) composes a /learn request and runs it in chat - docs (slash-commands ref + skills feature page), 11 targeted tests Inspired by OpenAI Codex's Record & Replay and the /learn concept from #47234 (dir-distillation engine); reworked to be open-ended and engine-free per review.	2026-06-23 13:51:28 -07:00
Teknium	6cc07b6cd0	feat(discord): render reasoning as -# subtext via display.reasoning_style (#51168 ) Adds a per-platform display.reasoning_style setting (code \| blockquote \| subtext) controlling how the show_reasoning summary renders on the gateway. Discord defaults to "subtext" (-# small grey metadata text); every other platform keeps the fenced code block. Resolves through the existing display.platforms.<platform>.reasoning_style override chain.	2026-06-23 10:44:02 -07:00
Ben Barclay	45bc4fb37f	feat(relay): declare relevance policy to the connector + document the management plane (#51248 ) The gateway half of Phase 6 Unit ζ: project the agent's existing relevance knobs into the connector's platform-agnostic vocabulary and declare them at boot over the /relay/policy route, so the SAME mention-gating / free-response / allow-bots behavior the agent applies directly also governs relay delivery (and excluded chatter never wakes a scaled-to-zero agent). - gateway/relay/__init__.py: - relay_relevance_policy(): project require_mention -> requireAddress, free_response_channels -> freeResponseScopes, {PLATFORM}_ALLOW_BOTS in {mentions,all} -> allowOtherBots. Reads the fronted platform's config block + bridged top-level keys. Returns None when all-default (the connector's quiet default already matches) or no concrete platform is fronted. - send_relay_policy(): POST /relay/policy authenticated with the gateway's own per-gateway upgrade token (make_upgrade_token — same bearer as the WS upgrade), so the connector attaches it to the authenticated instance, never a body-asserted id. Re-declares every boot (self-healing, full replace). NEVER raises, NEVER blocks boot — relevance is an optimization layered on the δ/ε authorization gate. Reuses the per-gateway secret + the /relay/provision host; no new inbound surface, no new credential. - _policy_url(): ws(s)://…/relay -> http(s)://…/relay/policy. - gateway/run.py: call send_relay_policy() after register_relay_adapter() succeeds (the secret is resolved by then). - docs/relay-connector-contract.md: new §7 documenting per-instance delivery + the management plane (/manage/* + /relay/policy) + the relevance-declaration contract; versioning renumbered to §8. Contract conformance test stays green (§2/§3 tables untouched). Tests: +12 (projection mapping incl. comma-string + top-level fallback; send auth/skip/fail-soft/non-200). Full relay suite 118 pass. The connector route is already E2E-proven (connector repo gateway_policy_driver.py); this adds the real gateway send-path it pairs with. This completes Phase 6 (Team Gateway per-user isolation) end to end.	2026-06-23 18:43:19 +10:00
Teknium	ff85af3fc7	feat(goals): /goal wait <pid> — park the loop on a background process (#50503 ) * feat(goals): add /goal wait <pid> barrier to park the loop on a background process The /goal loop re-pokes the agent every turn via the post-turn judge. When a goal is gated on a long-running background process (CI poller, build, test matrix, deploy) that produces nothing to judge yet, this spins the agent into 'is it done?' busy-work and burns the turn budget. /goal wait <pid> [reason] parks the loop: while the PID is alive, the judge is skipped, no turn is consumed, no continuation fires, and /goal status shows a parked indicator. The barrier auto-clears the moment the process exits (the agent's notify_on_complete watcher is the natural wake signal), then the next turn resumes normal judging. /goal unwait clears it manually; pause/resume/clear drop it; a dead/stale PID can never wedge the loop. Wired across CLI, gateway, and the mid-run command guard for parity. Barrier persists in SessionDB.state_meta (survives /resume); GoalState gains backward-compatible waiting_on_pid/waiting_reason/waiting_since fields. 12 new tests; docs updated. * fix(goals): use gateway.status._pid_exists for liveness, not os.kill(pid,0) The Windows-footguns CI guard flagged os.kill(pid, 0) in _pid_alive — on Windows that's not a no-op, it routes to CTRL_C_EVENT and hard-kills the target's console process group (bpo-14484). Delegate to the canonical footgun-safe gateway.status._pid_exists (psutil + ctypes/POSIX fallback) instead, with a direct-psutil last resort. * feat(goals): judge-driven auto-wait — the loop parks itself, no manual /goal wait Makes the wait barrier automatic. Every turn the judge is shown the agent's live background processes (pid, command, uptime, output tail from the process_registry) alongside the goal + response, and can return a new 'wait' verdict instead of continue: {"verdict":"wait","wait_on_pid":N} → park until that process exits {"verdict":"wait","wait_for_seconds":N} → park until the deadline passes evaluate_after_turn acts on the directive (sets the barrier, parks the loop) so the agent isn't re-poked into busy-work while CI/builds/deploys run. Adds a time-based waiting_until barrier alongside the pid barrier; both auto-clear and can never wedge the loop. Drivers (CLI, gateway, tui_gateway) feed the live registry in via gather_background_processes(). Manual /goal wait stays as an override. Judge verdict contract widened to (verdict, reason, parse_failed, wait_directive); legacy {"done":bool} shape still accepted. * test(goals): update kanban _fake_judge to the 4-tuple judge contract CI test(3) caught it: test_kanban_goal_mode's _fake_judge still returned the 3-tuple (verdict, reason, parse_failed), but the kanban loop now unpacks the 4-tuple (+ wait_directive). Update the fake to return None for the directive and accept the background_processes kwarg. * feat(goals): trigger-based wait — park on a process's own signal, not just exit Addresses two gaps in the judge-driven wait: (1) the judge could only express 'wait until PID exits' or 'wait N seconds', so a long-lived watcher/server that fires a trigger MID-RUN (and may never exit) couldn't be waited on; (2) the process's own watch_patterns/notify_on_complete trigger was invisible to the judge. Adds a session-based barrier (waiting_on_session) that releases on the process's OWN trigger via process_registry.is_session_waiting(): the session exits, OR (if started with watch_patterns) its pattern matches — even while the process keeps running. list_sessions() now surfaces session_id + watch_patterns/watch_hit/ notify_on_complete so the judge sees the trigger and is told to prefer wait_on_session for trigger processes. Judge verdict gains a {wait_on_session} directive (preferred over pid). Backward-compatible GoalState field; pid + time barriers unchanged. Tests: TestSessionTriggerBarrier (release on mid-run pattern match while alive, release on exit, unknown-session, full park→trigger→resume, parse, validation, backcompat load). 105 goal-surface + 85 process_registry tests green.	2026-06-22 06:27:29 -07:00
kshitij	1f28b1a9b9	fix(gateway): redact credentials from approval prompts before sending to clients (#48456 ) (#50767 ) Tirith redacts its own findings, but the approval-request callbacks built the operator prompt from the RAW command string, so a credential-shaped value Tirith flagged was sent verbatim to clients, undoing the redaction one layer up. Two egress transports carried the leak; both are fixed via a shared module-level seam _redact_approval_command() (redact_sensitive_text force=True): 1. chat platforms — _approval_notify_sync (gateway/run.py): redact before both the button path (send_exec_approval) and the plain-text /approve fallback. 2. SSE/API stream — _approval_notify (gateway/platforms/api_server.py): redact event['command'] before it is enqueued to API/desktop clients. (whole-bug-class: sibling call path on a separate transport.) force=True so the prompt — a hard secret-egress boundary — honors redaction even when security.redact_secrets is off. Clean commands pass through unchanged. Tests bind the seam (synthetic credential-format fixtures, force-when-disabled) AND assert BOTH callbacks ASSIGN the redacted result before the send/enqueue sink, via an AST contract that rejects a discarded-result call. All mutation-checked.	2026-06-22 11:39:45 +00:00
Shannon Sands	4b09903de5	fix Nous auth refresh for idle agents	2026-06-21 22:43:48 -07:00
teknium1	4314d451ca	fix(gateway): accept any inbound file type across all messaging platforms Authorization to message the agent is the gate, not the file extension. Previously the inbound-attachment allowlist (SUPPORTED_DOCUMENT_TYPES) was opt-OUT on Discord (allow_any_attachment defaulted false) and had no bypass at all on Telegram/Slack — so an .html (or any non-allowlisted type) was dropped or hard-rejected before the agent saw it. Now every authorized upload is cached and surfaced to the agent regardless of type: - base.cache_media_bytes(): unknown types cache as octet-stream (or the caller-supplied MIME) instead of returning None — fixes the chokepoint that Teams/Telegram-media route through. - discord/telegram/slack adapters: removed the allowlist reject/skip; any non-media attachment is typed DOCUMENT and cached. Known types keep their precise MIME. - Text inlining now gates on a shared _TEXT_INJECT_EXTENSIONS set (text + code + config + markup) instead of a blind UTF-8 decode, so binary formats (PDF/zip/docx) with ASCII headers are never inlined. - gateway/run.py emits the path-pointing context note for every DOCUMENT, including non text/application MIME types. - discord.allow_any_attachment is now a documented no-op kept for config back-compat. Validation: 357 gateway tests pass; E2E confirms .html/.bin/custom types cache, known types stay precise, PDFs are not inlined.	2026-06-21 22:43:45 -07:00
Ben Barclay	de6b3ae377	fix(terminal): bridge docker_extra_args to TERMINAL_DOCKER_EXTRA_ARGS in CLI + gateway (#50631 ) terminal.docker_extra_args passes flags verbatim to `docker run` (e.g. --gpus=all, --shm-size=16g). It was wired into DEFAULT_CONFIG, TERMINAL_CONFIG_ENV_MAP (so `hermes config set` bridged it), terminal_tool._get_env_config (reads TERMINAL_DOCKER_EXTRA_ARGS), and DockerEnvironment (applies extra_args) -- but it was MISSING from cli.py's env_mappings and gateway/run.py's _terminal_env_map. Consequence: a user who hand-edits config.yaml (rather than running `hermes config set`) has docker_extra_args silently dropped on the CLI and gateway/desktop startup paths, while docker_image / docker_volumes (which ARE in those maps) bridge correctly -- producing the reported 'Hermes partially reads the Docker config' symptom where --gpus=all and --shm-size=16g never reach docker run. This is the same bridge-coverage bug class that shipped before for docker_run_as_host_user (cli + gateway) and docker_mount_cwd_to_workspace (gateway). Fix by adding the key to both maps, plus a dedicated regression pin in test_terminal_config_env_sync.py mirroring the existing test_docker_*_is_bridged_everywhere guards.	2026-06-22 15:41:23 +10:00
teknium1	f45ace9318	feat(security): startup security posture audit (warn-on-load) Surface dangerous host/deployment posture at gateway startup so operators get the 'you're exposed' signal the June 2026 MCP-config persistence campaign victims never had. Warn-only — never blocks startup, never raises. Checks (each independently fail-safe): - Running as root (POSIX uid 0) - SSH daemon with PasswordAuthentication enabled (incl. the 'yes' default) - Running in a container with no persistent volume mount over HERMES_HOME - Network-accessible API server with no API_SERVER_KEY New module hermes_cli/security_audit_startup.py; invoked once per process from start_gateway() right after setup_logging(). Cross-platform (root/SSH checks no-op on Windows). Idea: @Cthulhu.	2026-06-21 19:05:27 -07:00
LeonSGP43	09a96ba0f6	fix(gateway): pause Telegram typing before stream finalize In Telegram streaming, the typing indicator persisted through the slow final rich-text/MarkdownV2 finalize edit, so the '...typing' bubble lingered for seconds after the last streamed token. Add a one-shot on_before_finalize hook to GatewayStreamConsumer, fired once when the stream transitions into its finalization path, and wire it on both Telegram streaming call sites to call pause_typing_for_chat() before the final edit. Cover hook ordering and once-only behavior in tests. Fixes #49712	2026-06-21 13:10:25 -07:00
Teknium	7a131f7f40	fix(api-server): stop silently promising async delivery on stateless HTTP path (#50319 ) * fix(api-server): stop silently promising async delivery on stateless HTTP path terminal(notify_on_complete=True / watch_patterns) and delegate_task(background=True) silently no-op'd on the API server / WebUI path (#10760): the watcher / detached child registered, but every API-server route (OpenAI-spec /v1/chat/completions and /v1/responses, plus the proprietary /v1/runs SSE stream) tears down its channel when the turn ends, and APIServerAdapter.send() is a no-op stub. A completion that fires after the response closed had nowhere to go — from the agent side, indistinguishable from a hang. There is no spec-compliant surface to wake the agent later on a stateless HTTP client, so make the no-op honest instead of silent: - Add a per-adapter capability flag supports_async_delivery (default True; APIServerAdapter = False), propagated into a HERMES_SESSION_ASYNC_DELIVERY contextvar via async_delivery_supported(). Toggle on the adapter, not a hardcoded platform string — a future stateless adapter is correct-by-default. - terminal: when delivery is unsupported, skip watcher registration, force notify_on_complete off, and return a notify_unsupported note telling the agent to process(action='poll'). - delegate_task: when delivery is unsupported, fall back to SYNCHRONOUS execution (work runs and returns in the same response) with a note, instead of handing out a handle that never resolves. CLI (in-process completion_queue) and the real gateway platforms are unchanged. Fixes #10760 * refactor(api-server): route session binding through a single no-delivery chokepoint Add APIServerAdapter._bind_api_server_session() and route both agent-entry paths (_run_agent for /v1/chat/completions + /v1/responses, and the /v1/runs _run_sync path) through it. The helper hardwires platform="api_server" and async_delivery=False with no async_delivery parameter to pass, so a future route added to the API server physically cannot reintroduce the silent no-op (#10760) by forgetting to mark the channel as non-delivering. The binding stays request-scoped (cleared per turn), so a session resumed later on a delivering interface (CLI / gateway platform) re-binds fresh and is NOT blocked — the no-delivery decision tracks the interface handling the current turn, never the session.	2026-06-21 12:15:14 -07:00
Teknium	d19aabbf2d	fix(gateway): persist in-flight transcript on restart/shutdown drain timeout (#50312 ) A turn forcibly interrupted by the drain-timeout escalation never reaches turn_finalizer.finalize_turn (the only place that flushes the turn to state.db). Its in-flight tool rounds live only in the in-memory _session_messages, so the immediate pre-restart turn was silently dropped from load_transcript() on resume. _finalize_shutdown_agents now flushes _session_messages to the SQLite session store before teardown. The flush is idempotent (identity-tracked in _flush_messages_to_session_db), so agents that finished gracefully re-flush nothing. The resume_pending / fresh-tool-tail branches in _handle_message_with_agent already expect a transcript whose tail may be a pending tool result. Fixes #13121.	2026-06-21 11:57:15 -07:00
yeyitech	b17180d950	fix(session): finalize owned SQLite session rows on AIAgent.close() Funnel session finalization through AIAgent.close() — the single terminal path every agent (CLI, gateway, subagent, cron) funnels through — so finished agents stop leaving rows with ended_at IS NULL. The biggest leak source was delegate_task subagent + background-review forks whose close() never ended their row. end_session() is first-reason-wins and no-ops on an already-ended row, so a 'compression'/'cron_complete'/'cli_close' reason set by an earlier terminal path is never clobbered. /resume already calls reopen_session(), so finalizing-on-close does not break resumability. Temporary helper agents that rotate/share the session forward (manual compression, gateway session-hygiene) opt out via _end_session_on_close=False. Also stop the long-running gateway heartbeat once the executor is done or the session slot is rebound to a different agent, preventing a stale 'running: delegate_task' bubble from outliving its run. Closes #12029.	2026-06-21 11:35:09 -07:00
Liao Shiwu	6f5f58e34b	fix: keep poll read-only for notify_on_complete watcher	2026-06-21 11:11:23 -07:00
Teknium	03563dabac	fix(gateway): raise session-hygiene hard message limit 400 → 5000 (#50194 ) The gateway pre-compression hygiene valve force-compressed any session crossing 400 messages regardless of token usage. On large-context (1M+) models doing many short, message-dense turns, a healthy session at ~16% token usage could hit 400 messages and get force-compressed — and the compression summary's stale Active Task could then bleed into the next turn. The valve's actual purpose is to break a death spiral: when API calls keep disconnecting on an oversized session, no token-usage data arrives, the token threshold never fires, and the transcript grows unbounded. It's a count-based floor for that pathological case only. 400 was tuned for ~200K-context models and is far too low for modern large-context sessions. Raise the default to 5000 — still well clear of any death spiral, but no longer firing on legitimate long conversations. The value remains fully configurable via compression.hygiene_hard_message_limit.	2026-06-21 08:26:19 -07:00
Ben	51a338a1b6	feat(gateway): track active_agents in runtime status on turn boundaries The gateway only rewrote gateway_state.json on lifecycle transitions (start/connect/drain/stop), never on turn start/end. Live-verified on a hosted agent: a confirmed end-to-end turn ran while gateway_updated_at stayed frozen at boot and active_agents was absent — so any active_agents read from the file between transitions is stale. That makes it unusable as a busy/idle signal for an external consumer (NAS deciding whether it's safe to restart/migrate/auto-update an agent mid-turn). Add _persist_active_agents(), called at every turn boundary: - turn start: both running-agent sentinel-claim sites (normal inbound message path + startup-resume path) - turn end: the central _release_running_agent_state() choke point (covers normal completion, /stop, /reset, sentinel cleanup, stale-eviction — every path that ends a running turn) It passes ONLY active_agents to write_runtime_status, leaving gateway_state (and every other field) _UNSET so the read-merge-write preserves the current lifecycle state. Passing gateway_state=None would clobber it — hence a dedicated helper rather than reusing _update_runtime_status. The write is the same cheap JSON write done on lifecycle transitions today; best-effort (a failed status write never disrupts a turn). Behaviour-contract test: an active_agents-only write preserves both running and draining gateway_state, and the count clamps non-negative.	2026-06-21 17:22:52 +05:30
teknium1	8ac5e90ec2	fix(gateway): dedup image_generate media across the compression boundary After context compression, the agent re-sent an already-delivered generated image on every subsequent turn (#46627). The auto-append fallback rescans full history when the message list shrinks (compression- safe path), deduping against _history_media_paths — but that set was built by scanning ONLY MEDIA: text tags in tool results. image_generate returns its path in a JSON payload field (host_image/image/agent_visible_image), never a MEDIA: tag, so generated-image paths never entered the dedup set and were re-emitted after the boundary. Extract the history-path collection into _collect_history_media_paths(), which now covers BOTH delivery shapes: MEDIA: text tags AND image_generate JSON-payload paths (mirroring what _collect_auto_append_media_tags extracts). The inline block in _handle_message is replaced with a call to the helper. Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>	2026-06-20 23:20:16 -07:00

1 2 3 4 5 ...

1129 commits