hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
teknium1	aacc15b2c9	fix(clarify): raise default clarify_timeout to 3600s (#32762 ) The 600s default evicted the gateway clarify entry while users were still away (meeting/AFK); a later button tap then landed on a dead entry and the agent hung on 'running: clarify'. Raise the default to 1h in DEFAULT_CONFIG and the get_clarify_timeout() code-level fallback, documenting the running-agent-guard tradeoff. User overrides still win.	2026-06-28 01:07:53 -07:00
konsisumer	3f543229f2	fix(telegram): notify user when clarify button tap arrives after expiry	2026-06-28 01:07:53 -07:00
Teknium	90d25adc9e	fix(gateway): deliver profile-scoped cache media on symlinked HERMES_HOME (#54060 ) Generated images under a profile gateway's cache (profiles/<name>/cache/ images/...) were silently dropped from Telegram/Discord delivery when HERMES_HOME is symlinked under a denied prefix (e.g. /opt/data -> /root/.hermes) and $HOME is not that prefix. The resolved path lands under /root (a system denylist prefix), the root-home exception only fires when the denied prefix IS $HOME, and the static safe-roots list only covers the active HERMES_HOME's top-level cache — not per-profile cache dirs. Both gates fail, so validate_media_delivery_path returns None and the gateway logs 'Skipping unsafe MEDIA directive path'. _media_delivery_allowed_roots() now also enumerates per-profile cache roots (<root>/profiles/*/cache/{images,audio,videos,documents, screenshots}) at check time. Allowlist match runs before the denylist, so the profile artifact delivers regardless of the /root interaction; profile-dir credentials (auth.json) stay blocked since they aren't under a cache subdir. Reopened regression of #34485/#38108, neither of which covered the profile-scoped symlink case. Fixes #31733.	2026-06-28 01:07:28 -07:00
sweetcornna	2701ea2f0c	fix(agent): reopen fallback chain after primary recovery	2026-06-28 00:57:42 -07:00
teknium1	7b9ff310b6	fix: salvage #33830 for current main — relocate allow_bots bridge to telegram plugin hook, fix stale adapter import in test	2026-06-28 00:57:03 -07:00
sweetcornna	fc70d023d8	fix(telegram): apply bot auth policy to Telegram sources # Conflicts: # gateway/config.py	2026-06-28 00:57:03 -07:00
teknium1	52d774f0f9	fix(state): F_FULLFSYNC barrier at WAL checkpoints on macOS (#30636 ) On Darwin, synchronous=FULL (the WAL default) only issues a plain fsync(), which Apple documents does NOT guarantee writes reach stable storage or stay ordered. SQLite's WAL corruption-safety guarantee assumes the OS honors the fsync barrier; macOS does not unless the app uses F_FULLFSYNC. During a launchd system shutdown the page cache is dropped (effectively power-loss for in-flight pages), so a WAL checkpoint whose fsync 'reported' durable may never hit the platter — corrupting state.db with a malformed image. That is the trigger in #30636 ('SIGTERM during launchd shutdown under high load'). Apply PRAGMA checkpoint_fullfsync=1 (macOS-guarded) in apply_wal_with_fallback. It forces the F_FULLFSYNC barrier only at checkpoint boundaries (where WAL frames land in the main DB), so cost amortizes to ~+0.1ms/commit vs ~+4ms for the broader fullfsync=1. No-op off Darwin (F_FULLFSYNC is macOS-only). Root-cause analysis by @catapreta on #30636. Supersedes #30654, whose synchronous=FULL is a no-op (already FULL in WAL mode) and whose TRUNCATE-on-close is already on main. Co-authored-by: catapreta <catapreta@users.noreply.github.com>	2026-06-28 00:53:19 -07:00
Teknium	7c38249c79	feat(moa): references see full tool state + fire on every user/tool response (#54016 ) The advisory reference view stripped all tool calls and tool results, so reference models judged a task whose actions and results they never saw — and references only fired once per user turn, never re-running as the agent's state advanced through the tool loop. Two fixes: - _reference_messages() now PRESERVES the agent's tool calls and tool results, rendering them inline as text ([called tool: ...] / [tool result: ...]) so a reference gives an informed judgement on the real current state. Still emits zero tool-role messages and zero tool_calls arrays (strict providers reject those), and large tool results are previewed head+tail (4000-char budget). The required end-on-user shape is met by APPENDING a synthetic advisory user turn — not by deleting the agent's latest context (which the prior fix did). - References now re-run on every state change — each new user message AND each new tool result — instead of once per user turn. The state-sensitive advisory signature drives the cache: new tool result = miss (re-run), identical-state re-call = hit (no re-run, no re-emit). The acting aggregator still receives the full, untrimmed transcript.	2026-06-28 00:30:11 -07:00
kshitijk4poor	fc7a01b6cb	test+harden: modernize salvaged Matrix path for current plugin layout Two follow-ups on top of the salvaged #46365 fix: 1. Tests: the salvaged tests injected the ephemeral MatrixAdapter via sys.modules["gateway.platforms.matrix"], but Matrix migrated to a plugin (#41112) and the fallback now imports from plugins.platforms.matrix.adapter. Point the three sys.modules patches at the current module path so the ephemeral-fallback tests actually exercise the injected fake adapter. 2. Harden the live-adapter lookup: split the gateway import guard from the adapter lookup and log (instead of silently swallowing) when a runner exists but adapters.get() raises. A silent fall-through there would re-introduce the per-send reconnect/OTK-exhaustion storm this fix exists to prevent (#46310). Documented that the live adapter is gateway-owned and must not be disconnected, and why the ephemeral finally never touches it.	2026-06-28 12:48:08 +05:30
liuhao1024	a7fd62d824	fix(send_message): reuse live gateway adapter for Matrix media sends When a live gateway adapter is available (i.e. the tool runs inside a running gateway), reuse the persistent connection instead of creating a new MatrixAdapter per call. This eliminates per-message E2EE re-init storms that exhaust recipient OTKs and silently drop messages. The fix follows the same pattern as _send_to_platform (line 618): gateway_runner_ref → runner.adapters[Platform.MATRIX]. Falls back to the ephemeral connect/disconnect cycle for standalone contexts. Also extracts the shared send logic into _send_via_matrix_adapter() to avoid duplicating the media dispatch code between the two paths. Fixes #46310	2026-06-28 12:48:08 +05:30
Ben Barclay	1466eab4ee	test(docker): wait for cont-init to finish before privilege-drop shim tests (#54026 ) The docker-exec privilege-drop shim tests started a sleep container and released the fixture as soon as `docker exec <c> true` returned 0. On s6-overlay that succeeds almost immediately — ~0.05s in measurement — long before the `01-hermes-setup` cont-init hook (docker/stage2-hook.sh) has finished seeding + `chown hermes:hermes` config.yaml and running the Python config migration (cont-init only fully settles at ~9.8s under arm64 QEMU emulation). `test_shim_opt_out_keeps_root` wipes config.yaml, writes it as root with HERMES_DOCKER_EXEC_AS_ROOT=1, and asserts root:root ownership. When the fixture released the test inside that ~10s window, stage2-hook's boot-time `chown hermes:hermes config.yaml` raced the root-written file and reset it to hermes:hermes — failing the assertion. The window is invisible on native amd64 (stage2-hook completes in a blink) but wide open under the arm64 build's QEMU emulation, which is why only build-arm64 flaked while build-amd64 stayed green. Replace the responsiveness poll with a wait on the canonical 'cont-init finished' signal: $HERMES_HOME/logs/container-boot.log gaining a `profile=default` line, written by 02-reconcile-profiles which s6 runs strictly after 01-hermes-setup. Mirrors the readiness pattern already used in test_container_restart.py. Also bumps the readiness timeout 20s->60s to cover slow emulation. No production code change — test-only hardening of a timing race.	2026-06-28 17:06:26 +10:00
Teknium	4f61d48aef	test(cron): deterministically wait for ticker, fix wall-clock flake (#54010 ) tests/cron/test_scheduler_provider.py spawned a background ticker thread, slept a fixed 0.2s, then asserted the loop had called tick()/heartbeat() at least N times. Under loaded CI the worker thread isn't always scheduled within that window, so the loop hadn't ticked yet — flaking with 'provider never called tick()' (assert 0 >= 1). Add a _wait_until(predicate, timeout) helper and replace all five fixed time.sleep(0.2) sites with a poll on the actual predicate (calls/beats count reached). Same contract assertions, no wall-clock dependence.	2026-06-27 22:52:29 -07:00
Teknium	1fa44180b0	fix(moa): advisory references end on a user turn + get a reference-role system prompt (#54007 ) * fix(moa): reference advisory view must end with a user turn MoA reference calls failed with Anthropic models that don't support assistant prefill (e.g. Claude Opus 4.8): '400 ... must end with a user message'. The advisory view built by _reference_messages() kept the last assistant turn's text while dropping the following tool result, leaving a trailing assistant turn — which Anthropic (and OpenRouter->Anthropic) interpret as an assistant prefill to continue. References are advisory and must end on the user turn they answer. Strip trailing assistant turns from the advisory view (preserving intervening ones). Update the existing test that encoded the buggy shape and add a mid-tool-loop regression test. * feat(moa): give reference models an advisory-role system prompt Reference models received the bare trimmed conversation with no role framing, so they assumed they were the acting agent and refused ("I can't access repositories/URLs from here") or tried to call tools they don't have. Prepend a dedicated advisory system prompt to every reference call: the model is an analyst, not the actor — it cannot execute, should not apologize for lacking tools, and should reason about the presented state to advise the aggregator/orchestrator on approach, next steps, tool-use strategy, risks, and anything the acting agent missed. Its output is private guidance for the aggregator, not a user-facing answer.	2026-06-27 22:52:25 -07:00
Teknium	2523917680	fix(tests): bare pytest flags pass through run_tests.sh without a '--' separator (#54008 ) The parallel runner only forwarded pytest args after a literal '--', so a bare 'scripts/run_tests.sh tests/foo.py -q' (or -v/-x/-k/--tb=long) errored out with 'unrecognized arguments'. This contradicted the docstring's promise that common pytest flags pass through, and forced a retry on every run that used pytest muscle-memory. Now any token starting with '-' that isn't one of the runner's own options (-j/--jobs, --paths, --slice, --file-timeout, --generate-slices, --files, --include-integration) is routed to each per-file pytest invocation automatically. Value-taking flags given space-separated (-k expr, -m mark, -p plugin, -o name=val, etc.) keep their value instead of having it stolen by positional-path discovery. The explicit '--' separator still works and stacks with bare flags. - scripts/run_tests_parallel.py: argv splitter routes bare unknown flags to pytest; value-flag lookahead; updated docstring. - scripts/run_tests.sh: usage comment reflects bare-flag passthrough. - tests/test_run_tests_parallel.py: 4 behavior-contract tests (bare -q runs, -k keeps its value/filters, '--' still works, positional path stays a root).	2026-06-27 22:43:26 -07:00
teknium1	c918d42d88	feat(desktop): config-driven Electron launch flags + GPU policy Adds a desktop: section to config.yaml so headless/VM users can make `hermes desktop` launch correctly without a wrapper command: - desktop.electron_flags: extra Electron CLI flags (e.g. --ozone-platform=x11) appended to every launch. Accepts a list or a shell-split string. - desktop.disable_gpu: auto\|true\|false, bridged to the HERMES_DESKTOP_DISABLE_GPU env var the Electron app already reads. An explicit env var still wins. cmd_gui() reads these via _desktop_launch_options() and applies them. This is the config.yaml form of the capability proposed as a raw env var in #38934 (@1RB) — behavioral settings belong in config.yaml, not a new HERMES_* env var. Co-authored-by: ray <86501179+1RB@users.noreply.github.com>	2026-06-27 22:26:43 -07:00
Rafael Millan	54ea059919	fix: fall back to no-sandbox for desktop launch on restricted Linux hosts	2026-06-27 22:16:20 -07:00
infinitycrew39	1fa46570fb	test(agent,gateway): cover partial-stream recovery and restart helper salvage	2026-06-27 22:03:14 -07:00
Teknium	4626ceb747	fix(gateway): only offer system-scope gateway install to root sessions (#53975 ) Non-root users picking 'System service' in the setup wizard were handed a 'sudo hermes gateway install --system --run-as-user <you>' recipe that fails on most distros: sudo's secure_path strips ~/.local/bin (pipx/uv installs), so 'sudo hermes' is command-not-found. Worse, it funnels a non-root user toward a system install they shouldn't be doing from a user session. Now prompt_linux_gateway_install_scope() only offers system scope when os.geteuid()==0. Non-root sessions get user-service or skip, with a tip to re-run as root for a boot service. The non-root branch in install_linux_gateway_from_setup becomes a defensive guard that refuses without printing any self-elevation recipe. Gated the matching deferral hint in setup.py behind root too.	2026-06-27 21:24:08 -07:00
Priyanshu Sharma	f6deabca0d	fix(gateway): clear stale base_url on model switches	2026-06-27 21:23:25 -07:00
teknium1	f54c52800a	fix(models): scope live-first picker merge to opencode aggregators only Follow-up to the salvaged #49129 commit. The original change flipped the shared generic-provider merge in provider_model_ids() to live-first unconditionally, which regressed curated-first for single providers (kimi/zai, #46309) — and the PR encoded that regression by flipping the kimi-coding and zai test assertions to expect live-first. Gate live-first on an explicit _LIVE_FIRST_PICKER_PROVIDERS set ({opencode-zen, opencode-go}); every other provider keeps curated-first. Also widen the uncapped picker + live-first sets to opencode-go, which has the same 70+ model catalog problem as opencode-zen. Restore the kimi-coding curated-first test and rewrite the merge-order test to assert the per-provider contract.	2026-06-27 21:23:25 -07:00
Afnath Ahamed	f98ffbc246	fix(models): live-first merge + update opencode-zen catalog + uncap aggregator picker	2026-06-27 21:23:25 -07:00
HexLab98	04ff4d9b54	test(auxiliary): cover env-only proxy policy for auxiliary clients (#53702 )	2026-06-27 21:22:49 -07:00
Teknium	3b23a984b5	feat(kanban): stamp handoff freshness so workers don't read stale state as current (#53973 ) Multi-agent boards leak staleness: a sibling worker's parent handoff, comment, or prior-attempt summary gets read by the next worker as live truth even when it's a day old. build_worker_context surfaced the text with (at best) a bare absolute timestamp, which an LLM reads as fact regardless of age — parent results had no timestamp at all. Adds a coarse relative-age stamp (just now / 18h ago / 3d ago) to every recalled-state line and a one-line 'point-in-time snapshot, re-verify against source' frame on the parent-results section, so the worker sees when handoffs were produced and re-checks stale ones before acting.	2026-06-27 21:21:54 -07:00
Teknium	131c9c542c	test(tui-gateway): stop deferred-resume build thread leaking into next test test_session_resume_uses_parent_lineage_for_display resumes via the deferred (non-eager) path, which fires a 50ms background Timer (_schedule_agent_build) calling whatever server._make_agent is patched in at that moment. The timer outlived the test and landed in the next test's (_follows_compression_tip) _make_agent mock, racily setting agent_session_id='tip' and flaking 'assert tip == cont_tip' on CI. Root-cause fix: stub _schedule_agent_build to a no-op in the leaking test (it only asserts display history). Defense in depth: the victim's fake_make_agent now setdefault()s so a stray late build can't overwrite the synchronous eager build's captured id.	2026-06-27 21:07:53 -07:00
Teknium	e418605450	test(24996): freeze monotonic clock to de-flake fallback cooldown timing The exhaustion-cooldown timing assertions relied on a wall-clock budget (before + window + 1.0s). On loaded CI runners the activation calls could exceed the 1s slack, flaking 'Run tests slice 4/8'. Freeze chat_completion_helpers.time.monotonic so the cooldown math is exact and load-independent across all four tests.	2026-06-27 21:07:53 -07:00
zccyman	db11849c9d	fix(skills): skip shadowing when external_dirs provides the skill Fixes #28126. sync_skills() was unconditionally writing bundled skills into the local <profile_home>/skills/ tree even when the profile's config.yaml delegated skill resolution to an external directory via skills.external_dirs. The skill loader then saw two candidates for the same name (local shadow + external canonical), refused to resolve on collision, and every worker that auto-loaded such a skill crashed with 'Unknown skill(s): <name>'. Changes: - _build_external_skill_index() indexes skills available in external dirs (by directory name and frontmatter name) - sync_skills() skips writing a bundled skill when it finds the same name in the external index; records the hash in the manifest so subsequent syncs treat it as already handled - Self-healing: removes stale local shadows left by prior buggy syncs (only when origin_hash == bundled_hash == user_hash, i.e. we wrote it and user didn't touch it) - New 'shadowed_by_external' key in sync_skills() return dict 3 new tests in TestExternalDirsIndexing (all passing). All 48 tests in test_skills_sync.py pass. Closes #28126	2026-06-27 21:07:53 -07:00
Teknium	a8c862900b	fix(tui): sanitize replay history on WebUI/TUI session resume (#29086 ) (#53939 ) A WebUI/TUI session whose last turn died mid-tool-loop (stale-timeout kill, interrupt, or process restart before the tool result was written) persists a dangling assistant(tool_calls) or interrupted assistant->tool tail. The messaging gateway already strips these tails before replay (the #49201 fix), but the TUI/WebUI resume path fed db.get_messages_as_conversation() straight in as the agent's conversation_history with no cleanup. The model re-issued the unanswered call on every resume -- including after a full WebUI + Gateway restart, since the poison lives in the SessionDB, not memory -- leaving the session permanently 'thinking'. Only deleting the session recovered it. - Extract the two strippers + helper from gateway/run.py into a shared agent/replay_cleanup.py (sanitize_replay_history wraps both). - gateway/run.py re-exports under the historical private names; messaging behavior unchanged. - Both TUI cold-resume sites now sanitize the model-fed history while leaving the display transcript untouched, so the user still sees their full history. Verified E2E against a real SessionDB: dangling and interrupted tails are stripped from the model feed, healthy mid-progress tool sequences are preserved, and the display transcript is always the full raw history.	2026-06-27 20:56:49 -07:00
Teknium	f03823014b	fix(telegram): kill 409 polling conflict loop by disarming PTB retry synchronously (#53941 ) Telegram polling entered a self-inflicted ~31s loop of 409 Conflict -> retry -> resume -> Conflict. The error_callback PTB invokes synchronously inside its internal network_retry_loop only scheduled our async recovery task (loop.create_task) and returned, so PTB kept polling getUpdates on its own while our handler concurrently ran stop -> sleep -> start_polling. The two polling sessions overlapped and Telegram returned a fresh 409. Fix: in the conflict branch of the error_callback, synchronously set PTB's private polling stop_event before scheduling recovery. PTB's loop exits on its next tick (it races that event in do_action), so our handler owns polling alone. The handler's await updater.stop() drains the task and PTB clears the event, so the subsequent start_polling() builds a fresh event and is not poisoned. Keeps the existing reconnect ladder intact (option B) — fixes only the race. Defensive: probes mangled + unmangled stop_event spellings and no-ops (prior behaviour) if neither exists; never flips _running, which would make the handler skip stop() and leave the loop wedged.	2026-06-27 20:46:08 -07:00
Teknium	d43e0cf304	fix(agent): config-driven intent-ack continuation for all api_modes (#27881 ) (#53943 ) * fix(agent): config-driven intent-ack continuation for all api_modes (#27881) The agent could end a turn after only stating intent ('I will run a health check...') without executing the announced tool call, forcing the user to re-prompt. A continuation guard that catches this and nudges the model to proceed already existed but was hard-gated to the codex_responses api_mode, so Gemini/Claude/OpenRouter turns never benefited. - New agent.intent_ack_continuation config (default 'auto' = codex-only, byte-stable for existing conversations). 'true'/model-list opts every api_mode in; 'false' disables. Mirrors agent.tool_use_enforcement's shape. - looks_like_codex_intermediate_ack gains require_workspace (default True). The opted-in path drops the codebase/filesystem requirement so general autonomous workflows (server ops, deploys, API calls) are caught, not just coding tasks. Future-ack + action-verb + short-content + no-prior-tool guards still apply; the 2-nudge-per-turn cap is unchanged. - Resolution centralized in intent_ack_continuation_mode (off/codex_only/all). * docs(infographic): intent-ack continuation (#27881)	2026-06-27 20:46:00 -07:00
Teknium	56abbaeac3	fix(curator): fail closed on unverified skill deletes during consolidation (#53935 ) The curator's LLM consolidation pass could archive whole clusters of active skills with zero verified consolidations (#29912): a bare prune (skill_manage delete with absorbed_into empty/omitted) from the forked review agent was accepted, removing the skill's name from lookup even though counts.consolidated_this_run was 0. - _delete_skill now fails closed during the curator/background-review pass: a delete is only allowed when it declares a verified consolidation (absorbed_into=<umbrella>, umbrella must exist). A prune with no forwarding target is refused; the skill stays active. The deterministic inactivity prune (archive_skill) is unaffected. - A verified consolidation delete during the curator pass now routes through the recoverable archive primitive instead of shutil.rmtree, so a misjudged consolidation can be undone with hermes curator restore. The usage record is kept (state=archived) rather than forgotten. - Foreground, user-directed deletes keep their existing hard-delete semantics.	2026-06-27 20:45:57 -07:00
konsisumer	11b0be8d15	fix(gateway): avoid Matrix pending invite boot loops	2026-06-27 20:45:51 -07:00
teknium1	a1ac6baac4	fix(gateway): make bg-process reset TTL configurable + surface session-scoped processes Follow-up to the cherry-picked #29212 (#29177): - Promote the 24h stale-process threshold to config.yaml (session_reset.bg_process_max_age_hours) instead of a hardcoded constant. 0 disables the cutoff (legacy: any live process blocks reset). Wired through GatewayConfig.default_reset_policy in gateway/run.py. - Bug 2: process(action=list) now resolves the gateway session_key from the contextvar and surfaces session-scoped background processes (a forgotten preview server under a different task), flagged session_scoped — so the agent/user can discover and kill the blocker. Previously the task-scoped list returned [] and the blocker was invisible. - Tests: config round-trip for the new field, cross-task list visibility. - Docs: messaging session-reset section.	2026-06-27 20:45:43 -07:00
annguyenNous	33d8b66d5b	fix: stale background processes no longer permanently block session reset Background processes (e.g. http.server preview) that Hermes starts and forgets about previously blocked session idle/daily reset indefinitely. The reset guard in session.py checked has_active_for_session() with no max age — a 3-day-old preview server blocked reset the same as a task started 30 seconds ago. Changes: - Add max_active_age parameter to has_active_for_session() in process_registry.py. Processes older than this threshold are ignored. - Add MAX_ACTIVE_PROCESS_AGE constant (24h / 86400s). - Wire max_active_age into the gateway's session store callback in run.py so stale processes no longer block session lifecycle. - Add debug logging when reset is skipped due to active processes. - Add 3 tests covering recent, stale, and legacy (None) max age. Fixes #29177	2026-06-27 20:45:43 -07:00
teknium1	9c6229ce24	fix(security): centralize credential-safe subprocess env (#29157 ) Subprocesses spawned outside the terminal/execute_code path (agent-browser, copilot ACP, dep-ensure, lazy_deps uv install, TUI Node host, cli.exec) inherited the operator's full credential environment via os.environ.copy(). The terminal path was already scrubbed by _HERMES_PROVIDER_ENV_BLOCKLIST (#1002/#1264/#32314); these spawn sites bypassed it. Adds hermes_subprocess_env(inherit_credentials=) in tools/environments/local.py reusing the existing dynamic blocklist as the single source of truth: - Tier 1 (_ALWAYS_STRIP_KEYS): gateway bot tokens, GitHub auth, infra secrets -- stripped even for credential-inheriting children. - Tier 2 (_HERMES_PROVIDER_ENV_BLOCKLIST): provider/tool keys -- stripped unless inherit_credentials=True. The opt-in is grep-able for audit. Browser worker keeps a _BROWSER_PASSTHROUGH_KEYS allowlist (BROWSERBASE/ FIRECRAWL) re-added after the strip. Model-driving children (ACP, TUI Node host, cli.exec) use inherit_credentials=True so they still get provider keys while losing Tier-1 secrets. Installers (dep-ensure, lazy_deps) inherit nothing sensitive. cua_backend already routed through _sanitize_subprocess_env on main -- left as-is. Gateway adapter utility spawns (gh pr comment, ffmpeg) are left inheriting env: gh needs GH_TOKEN by design, ffmpeg is a trusted system binary -- no untrusted-dependency exposure. This is defense-in-depth (personal-assistant trust model: same-user spawns), making the existing scrub policy uniform across the spawn surface; the main real payoff is shrinking the blast radius if a transitive npm dep in agent-browser is compromised. Reconstructed on current main from the design in #31959 (Tranquil-Flow); also credits #39003 (rodboev), #37843 (coygeek), #35769 (egilewski). Co-authored-by: Tranquil-Flow <tranquil_flow@protonmail.com> Co-authored-by: rodboev <rod.boev@gmail.com> Co-authored-by: egilewski <egilewski@egilewski.com>	2026-06-27 20:45:31 -07:00
Hermes Agent	88b3d8638e	test: de-flake SIGKILL-tree, compression-tip resume, and fallback-cooldown tests Three CI flakes hit while landing the credential-pool restore fix; all three were timing/wall-clock races in the tests, not product bugs (each passes locally and the assertions are correct): - test_entire_tree_is_sigkilled_not_just_parent: _terminate_host_pid SIGKILLs synchronously, but the test's 4s budget after a 1s in-function SIGTERM grace left almost no slack for the kernel to tear down 3 processes + reparent the children to zombies under loaded-CI scheduling. Widen the wait to 15s and make the liveness predicate tolerant of vanished-pid / zombie races. The assertion never weakens: every tree member must end up dead or zombie. - test_session_resume_follows_compression_tip: appended messages got time.time() timestamps (~now) while the test forced session started_at into the past, so the get_compression_tip MAX(m.timestamp) tiebreaker depended on wall-clock ordering. Pass explicit, well-separated message timestamps so the chain resolution is deterministic by construction. - test_non_retryable_exhaustion_arms_cooldown: asserted the short (5s) exhaustion cooldown with a tight +1.0s slack, which false-fails when wall-clock jitter between the 'before' snapshot and the cooldown computation exceeds a second on a loaded runner. Widen to +30s — still cleanly below the 60s rate-limit window it must distinguish from.	2026-06-27 20:04:45 -07:00
Jack Maloney	f0de4c6a47	fix(pool): re-select from credential pool on primary runtime restore _restore_primary_runtime restored the construction-time api_key snapshot and never consulted the credential pool. After the pool rotated away from a revoked/exhausted entry mid-session, every new turn restored the dead key, re-failed instantly, burned the remaining entries, and fell through to cross-provider fallback. After restoring the snapshot, re-select the pool's current best entry and swap the live credential in via _swap_credential (which already rebuilds the OpenAI/Anthropic client, reapplies base-url headers, and carries the #33163 base_url / OAuth-detection fixes). Falls back to the snapshot key when the pool is absent, empty, or the entry has no usable key. Salvaged from #25206 onto current main: the original targeted the pre-refactor monolithic method in run_agent.py; the logic now lives in agent/agent_runtime_helpers.py and is collapsed onto _swap_credential instead of re-inlining the client rebuild. Fixes #25205	2026-06-27 20:04:45 -07:00
kshitijk4poor	2af1678bfc	fix(auth): explicit provider intent beats stale OAuth active_provider (#29285 ) `resolve_provider("auto")` checked `auth.json` `active_provider` BEFORE the config.yaml `model.provider` and env-var API-key checks. So a user who was OAuth-logged-into one provider (e.g. Anthropic) but had set an explicit `model.provider` or exported an API key (e.g. `OPENAI_API_KEY`) was silently routed to the stale OAuth provider — the override was invisible and surprising. Reorder the auto-path so explicit intent wins (the order the issue asks for): 1. explicit CLI api_key/base_url 2. config.yaml `model.provider` (safety net — see below) 3. OPENAI_API_KEY / OPENROUTER_API_KEY env 4. OpenRouter credential pool 5. provider-specific API-key env vars 6. auth.json `active_provider` (OAuth) ← demoted to last-resort 7. AWS Bedrock credential chain 8. error `active_provider` is still honored — it's just a last-resort fallback chosen only when the user expressed no other preference, instead of overriding one. The normal chat/gateway/TUI/ACP/status path already resolves config.provider upstream in `resolve_requested_provider()` before "auto" is reached, so this duplicate config check is the safety net for the lone direct caller (`main.py` `resolve_provider("auto")`) and any future bypass. Because every surface funnels through this one resolver, the fix propagates everywhere with a single edit — no sibling path re-implements precedence. Also add a one-shot WARN when resolution lands on `active_provider` while a populated `model` config dict lacks a `provider` key — surfacing the silent override the issue reported without breaking first-install. Synthesizes the two competing PRs: #29615 (LifeJiggy — config-before-auth + the silent-override framing) and #29809 (Minksgo — the env-before-auth reorder). #29809 could not be merged directly (bundled unrelated, un-opt-in cost-tagging telemetry); its reorder idea is incorporated here and credited. Tests: tests/hermes_cli/test_provider_precedence.py — config/env beat stale OAuth, OAuth still used as last resort, explicit request short-circuits, WARN fires on silent fall-through. Full provider-resolution suites: 374 passed. Fixes #29285 Co-authored-by: LifeJiggy <141562589+LifeJiggy@users.noreply.github.com> Co-authored-by: Minksgo <153416856+Minksgo@users.noreply.github.com>	2026-06-27 19:49:02 -07:00
teknium1	2b73dd1ca6	fix(gateway): namespace --replace takeover marker by HERMES_HOME to stop cross-profile flap (#29092 ) Two profile gateway services sharing the default ~/.hermes resolve the takeover marker to the same path. A --replace from profile B could land in profile A's marker, match on PID + start_time by coincidence of a shared PID namespace, and make profile A exit 0 — only to be revived by systemd Restart=always, which races the replacer again, flapping indefinitely. write_takeover_marker now stamps replacer_hermes_home; the shared consume path rejects markers written under a different HERMES_HOME and leaves them in place for the correct profile. Absent field (older markers) is treated as same-home, so single-profile and mixed old/new deployments are unaffected. Salvaged from #31414 by @CryptoByz onto current main (branch was ~3962 commits behind; the consume function had since been refactored for issue #34597). Co-authored-by: CryptoByz.	2026-06-27 19:43:02 -07:00
郝鹏宇	98488c4be4	fix(config): prevent save_config from materialising schema defaults Fixes #27354 Root cause: called during init (or by any code path that saves ) wrote injected schema defaults into config.yaml as if the user had authored them. Two fix layers: 1. now only injects when the user actually set somewhere (root or agent). A user who never set keeps it absent, so 's explicit-path detection won't treat it as user-authored. 2. gains a parameter and a new pass that removes keys matching unless those paths were explicitly present in the raw (pre-normalization) config on disk. Explicit-path detection uses on before any normalisation runs — preventing injected-in defaults from being mistaken for user-set values. All migration and edit-config call sites pass to preserve their intentional default-seeding behaviour. New helpers: - — collects leaf-key paths from a raw dict - — removes keys matching schema defaults Test coverage: 4 new regression tests (59 total, all passing).	2026-06-27 19:38:11 -07:00
teknium1	6dcc579bcb	test(streaming): repoint anthropic stream-cleanup test to close+rebuild path The existing test_anthropic_stream_parser_valueerror_retries_before_delivery asserted mock_replace.call_count == 1 — i.e. it passed precisely because the buggy OpenAI rebuild was invoked on the Anthropic path. Repoint it to assert the corrected close+rebuild-Anthropic behavior (#28161).	2026-06-27 19:37:33 -07:00
EloquentBrush0x	a0b9663c7c	fix(streaming): rebuild Anthropic client on stream cleanup instead of OpenAI client interruptible_streaming_api_call() has three connection-pool cleanup sites that called _replace_primary_openai_client() unconditionally. For api_mode=anthropic_messages this has two consequences: 1. _replace_primary_openai_client() fails (OPENAI_API_KEY unset on Anthropic-only configs), so dead connections are never purged. 2. The stale-stream detector's outer-poll site (L1977) is the only mechanism that can interrupt the worker thread while it blocks in for event in stream:. Because the Anthropic client is never closed, the thread stays blocked until the 900 s httpx read-timeout fires, producing a visible 15-minute hang for Telegram/gateway users on claude-opus-4-7. Fix: mirror the existing interrupt-path pattern (L1989-1997) at all three cleanup sites — if api_mode == "anthropic_messages", call _anthropic_client.close() + _rebuild_anthropic_client() instead of _replace_primary_openai_client(). _rebuild_anthropic_client() handles both direct Anthropic and Bedrock-hosted Claude correctly, unlike the inline build_anthropic_client() calls in open PR #14430. PR #14430 (open) covers only the outer stale-detector site (L1977). PR #23678 (open) covers only the inner retry sites (L1774, L1833). This PR covers all three sites and uses _rebuild_anthropic_client() for Bedrock parity. Fixes #28161	2026-06-27 19:37:33 -07:00
xxxigm	6f1a176b33	fix(gateway/discord): REST liveness probe to detect zombie clients (#26656 ) The Discord adapter could enter a silent zombie state after a network outage / proxy stall: the process is alive, _client looks open, but the underlying socket is dead. discord.py's WebSocket reconnect never sees a RST through a wedged proxy/NAT, so client.start() spins forever without exiting — which means the bot-task done callback (which only fires on task completion) never trips either. The bot stays "offline" in Discord until a manual `hermes gateway restart`. Reported offline for 13-17h. Adds an out-of-band REST liveness probe in DiscordAdapter. Every `discord.liveness_interval_seconds` (default 60s) the adapter issues a cheap fetch_user(bot_id) — the same REST path as message delivery, so it fails when the proxy/NAT is wedged. After `discord.liveness_failure_threshold` consecutive failures (default 3) the probe closes the wedged client and surfaces a retryable fatal error, which trips the gateway's existing _platform_reconnect_watcher and rebuilds the adapter. Operators disable it by setting either knob to 0. Config lives in config.yaml (discord.liveness_) per the .env-is-secrets policy; _apply_yaml_config bridges it to internal env vars the adapter reads, matching the existing HERMES_DISCORD_TEXT_BATCH_ pattern. Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-27 19:30:32 -07:00
teknium1	457c8a0a7c	fix(file-ops): keep worktree isolation when restoring preserved cwd (#26211 ) The durable _last_known_cwd anchor is keyed by the shared 'default' container, so a non-owning worktree session could inherit the owning session's cwd through it — breaking the wrong-worktree-routing fix (test_file_tools_cwd_resolution:: test_resolution_routes_to_resolving_sessions_worktree). Reorder _authoritative_workspace_root so the session-specific registered cwd override (keyed by raw session id) is checked BEFORE the shared-container _last_known_cwd fallback. A non-owning session now resolves into its own registered worktree; the durable anchor only fills in when there's no session-specific override (the #26211 single-session case). Adds a regression test covering the owner-mirrors-then-other-session-resolves interaction.	2026-06-27 19:29:06 -07:00
teknium1	b2faeba182	fix(file-ops): make preserved cwd reachable at write-time resolution (#26211 ) Belt-and-suspenders on top of the cherry-picked cwd-preservation fix: - Proactively mirror every live terminal cwd into _last_known_cwd on each successful read, so the durable anchor survives even when the cleanup thread pops both _file_ops_cache and _active_environments before _get_file_ops' stale-cache save branch can fire. - Fall back to _last_known_cwd in _authoritative_workspace_root. write_file_tool resolves the path (via _resolve_path_for_task) BEFORE _get_file_ops rebuilds the env, so restoring only the rebuilt env's cwd was insufficient — the resolution that decides where the file lands runs first. This closes that gap. The local env's persisted _cwd_file can't serve this role: it's keyed by a random per-session uuid and deleted on cleanup (the same cleanup that triggers the bug). The in-memory _last_known_cwd registry is the durable anchor instead. Adds a real-IO E2E regression (TestSilentFileMisplacementE2E) exercising the actual write_file_tool path after env cleanup.	2026-06-27 19:29:06 -07:00
zccyman	adeba1d7a8	fix(file-ops): preserve CWD across terminal environment re-creation (#26211 ) Root cause: when the terminal environment (`_active_environments` entry) is cleaned up and re-created during a long conversation, the new environment always starts with the default config CWD (typically `~/.hermes/hermes-agent`) instead of preserving the user's last-known working directory. Subsequent relative-path writes (`write_file`, `execute_code`, shell commands) silently land in the default CWD, making files appear to be "created but absent." Fix: add `_last_known_cwd` dict that preserves the old environment's CWD before the stale cache entry is invalidated. When a new environment is created for the same task_id, we check `_last_known_cwd` first and use the preserved CWD instead of the config default. Changes: - tools/file_tools.py: add `_last_known_cwd` dict, save CWD before stale cache invalidation, restore CWD on env recreation - tests/tools/test_file_tools.py: add `TestLastKnownCwd` with 2 tests verifying CWD preservation and fallback behavior Fixes #26211	2026-06-27 19:29:06 -07:00
teknium1	926a1b915d	fix(tools): suppress transient check_fn flakes so subagents keep file/terminal tools A flaky external probe in a tool's check_fn (e.g. check_terminal_requirements running `docker version` with a 5s timeout, momentarily timing out under load) would return False for a single get_tool_definitions() call. Because file tools delegate their check_fn to the terminal check, that one flake silently stripped read_file/write_file/patch/search_files AND terminal from whatever agent was being constructed at that instant — most visibly a delegate_task subagent, which then reported "Tool read_file does not exist". This explains both the intermittent (~80% success) user-session failures and the deterministic cron failures in #21658 / #5304. The existing _check_fn TTL cache made this worse: it cached the transient False for the full 30s window, poisoning every subagent spawned in that span. Fix: remember the last time each check_fn returned True; when a fresh probe fails within a short grace window of that success, treat it as a flake — serve the last-good True and do NOT cache the failure (so the next call re-probes). A failure with no recent success, or past the grace window, is honored normally so a backend that genuinely went down stops advertising its tools. Probe failures now log at WARNING regardless of quiet mode, making the previously-silent tool loss diagnosable in subagent (quiet) sessions. Co-authored-by: Stuart Horner <5261694+djstunami@users.noreply.github.com>	2026-06-27 19:29:00 -07:00
Shashwat Gokhe	505bc27d8d	fix(gateway): classify mixed attachments per-attachment + transcode uncommon image formats A document attached alongside an image in the same Discord message was swept into the vision pipeline and 400'd the whole turn ("Could not process image"), and was simultaneously never surfaced to the agent as a readable file. Restores the "any file type works" contract for mixed messages and fixes the HTTP 400. Bug 1 — mixed attachments: the inbound routing loop keyed image/audio/video classification off the message-level type (PHOTO/VOICE/AUDIO), so a doc in a PHOTO message landed in image_paths and poisoned the vision call. The document context-note path was gated on message_type == DOCUMENT, so that same doc never reached the agent at all. Now classification is per-attachment (trust each attachment's own MIME; fall back to the message-level type only when MIME is unknown), via shared _event_media_is_* helpers used by both _build_media_placeholder and the main inbound loop. The document note now fires for any non-image/audio/video attachment regardless of message-level type. Bug 2 — uncommon formats: AVIF/HEIC/BMP/TIFF/ICO produced the same generic 400 because providers only accept PNG/JPEG/GIF/WEBP. image_routing now transcodes those to PNG via Pillow before declaring media_type, skipping cleanly (logged) if Pillow/plugins are missing. SVG is vector — Pillow can't rasterize it — so it's skipped rather than transcoded. Closes #25935. Co-authored-by: LeonSGP43 <cine.dreamer.one@gmail.com> Co-authored-by: cypres0099 <74935762+cypres0099@users.noreply.github.com>	2026-06-27 19:26:04 -07:00
teknium1	0c372274cd	fix(agent): disable OpenAI SDK auto-retry that double-fires inside the rate-limit loop Same bug class as the Anthropic fix (#26293): the OpenAI/aggregator client is built without max_retries, so the SDK default of 2 applies. The SDK's own 1-2s backoff ignores Retry-After and retries inside hermes's outer conversation loop, burning request slots against a rate-limited bucket. Set max_retries=0 at the single create_openai_client chokepoint (covers init, switch_model, recovery, restore, request-scoped). auxiliary_client builds its own clients and is not wrapped by the loop, so it keeps SDK retries.	2026-06-27 19:23:15 -07:00
konsisumer	1ab35ba25d	fix(anthropic): stop SDK auto-retry double-firing and raise Retry-After cap to 600s The Anthropic SDK clients were built without max_retries, so the SDK default (max_retries=2) retried 429/5xx with its own backoff that ignores Retry-After — double-retrying inside hermes's outer loop and burning request slots against a bucket that won't refill for minutes. Set max_retries=0 on all Anthropic/AnthropicBedrock client constructions so the outer conversation loop (which already honors Retry-After) owns retry. Also raise the Retry-After cap in the conversation loop from 120s to 600s. Anthropic Tier 1 input-token buckets reset in ~171s, so the 120s cap made hermes retry before the reset window and re-trip the limit. Refs #26293	2026-06-27 19:23:15 -07:00
LeonSGP43	32732a8f83	fix(agent): cap same-entry credential refreshes so fallback can activate (#26080 ) A persistent upstream 401 on a single-entry OAuth pool (common for Claude Max subscribers) made the credential-pool recovery spin forever: try_refresh_current() re-mints a fresh token and reports success on every 401, so recover_with_credential_pool returned True and the retry loop continue'd without ever incrementing retry_count or reaching the auth-failover block. The configured fallback_model never activated and the agent appeared to hang. Cap consecutive successful same-entry refreshes (keyed by provider + pool-entry id) at 2; once exceeded, treat the credential as unrecoverable and return not-recovered so the loop falls through to _try_activate_fallback. The 429/billing paths already rotate-or-fall-through correctly (mark_exhausted_and_rotate returns None on a single entry), so only the auth-refresh branch needed the cap. Co-authored-by: Hermes Agent <hermes@nousresearch.com>	2026-06-27 19:20:07 -07:00

1 2 3 4 5 ...

6446 commits