hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Priyanshu Sharma	f6deabca0d	fix(gateway): clear stale base_url on model switches	2026-06-27 21:23:25 -07:00
teknium1	f54c52800a	fix(models): scope live-first picker merge to opencode aggregators only Follow-up to the salvaged #49129 commit. The original change flipped the shared generic-provider merge in provider_model_ids() to live-first unconditionally, which regressed curated-first for single providers (kimi/zai, #46309) — and the PR encoded that regression by flipping the kimi-coding and zai test assertions to expect live-first. Gate live-first on an explicit _LIVE_FIRST_PICKER_PROVIDERS set ({opencode-zen, opencode-go}); every other provider keeps curated-first. Also widen the uncapped picker + live-first sets to opencode-go, which has the same 70+ model catalog problem as opencode-zen. Restore the kimi-coding curated-first test and rewrite the merge-order test to assert the per-provider contract.	2026-06-27 21:23:25 -07:00
Afnath Ahamed	f98ffbc246	fix(models): live-first merge + update opencode-zen catalog + uncap aggregator picker	2026-06-27 21:23:25 -07:00
teknium1	2e7e600eaa	chore(release): map HexLab98 author for PR #53863 salvage	2026-06-27 21:22:49 -07:00
HexLab98	04ff4d9b54	test(auxiliary): cover env-only proxy policy for auxiliary clients (#53702 )	2026-06-27 21:22:49 -07:00
HexLab98	073847c0f2	fix(auxiliary): use env-only proxy policy for OpenAI SDK clients (#53702 ) Auxiliary clients now inject a keepalive httpx transport with explicit HTTPS_PROXY/NO_PROXY resolution, matching the main agent. This avoids macOS system proxy settings (which omit the ExceptionsList) breaking vision and other auxiliary calls to internal provider endpoints.	2026-06-27 21:22:49 -07:00
Teknium	3b23a984b5	feat(kanban): stamp handoff freshness so workers don't read stale state as current (#53973 ) Multi-agent boards leak staleness: a sibling worker's parent handoff, comment, or prior-attempt summary gets read by the next worker as live truth even when it's a day old. build_worker_context surfaced the text with (at best) a bare absolute timestamp, which an LLM reads as fact regardless of age — parent results had no timestamp at all. Adds a coarse relative-age stamp (just now / 18h ago / 3d ago) to every recalled-state line and a one-line 'point-in-time snapshot, re-verify against source' frame on the parent-results section, so the worker sees when handoffs were produced and re-checks stale ones before acting.	2026-06-27 21:21:54 -07:00
Teknium	131c9c542c	test(tui-gateway): stop deferred-resume build thread leaking into next test test_session_resume_uses_parent_lineage_for_display resumes via the deferred (non-eager) path, which fires a 50ms background Timer (_schedule_agent_build) calling whatever server._make_agent is patched in at that moment. The timer outlived the test and landed in the next test's (_follows_compression_tip) _make_agent mock, racily setting agent_session_id='tip' and flaking 'assert tip == cont_tip' on CI. Root-cause fix: stub _schedule_agent_build to a no-op in the leaking test (it only asserts display history). Defense in depth: the victim's fake_make_agent now setdefault()s so a stray late build can't overwrite the synchronous eager build's captured id.	2026-06-27 21:07:53 -07:00
Teknium	e418605450	test(24996): freeze monotonic clock to de-flake fallback cooldown timing The exhaustion-cooldown timing assertions relied on a wall-clock budget (before + window + 1.0s). On loaded CI runners the activation calls could exceed the 1s slack, flaking 'Run tests slice 4/8'. Freeze chat_completion_helpers.time.monotonic so the cooldown math is exact and load-independent across all four tests.	2026-06-27 21:07:53 -07:00
teknium1	1ad8b44413	docs(infographic): skill sync external_dirs shadow fix	2026-06-27 21:07:53 -07:00
zccyman	db11849c9d	fix(skills): skip shadowing when external_dirs provides the skill Fixes #28126. sync_skills() was unconditionally writing bundled skills into the local <profile_home>/skills/ tree even when the profile's config.yaml delegated skill resolution to an external directory via skills.external_dirs. The skill loader then saw two candidates for the same name (local shadow + external canonical), refused to resolve on collision, and every worker that auto-loaded such a skill crashed with 'Unknown skill(s): <name>'. Changes: - _build_external_skill_index() indexes skills available in external dirs (by directory name and frontmatter name) - sync_skills() skips writing a bundled skill when it finds the same name in the external index; records the hash in the manifest so subsequent syncs treat it as already handled - Self-healing: removes stale local shadows left by prior buggy syncs (only when origin_hash == bundled_hash == user_hash, i.e. we wrote it and user didn't touch it) - New 'shadowed_by_external' key in sync_skills() return dict 3 new tests in TestExternalDirsIndexing (all passing). All 48 tests in test_skills_sync.py pass. Closes #28126	2026-06-27 21:07:53 -07:00
Teknium	a8c862900b	fix(tui): sanitize replay history on WebUI/TUI session resume (#29086 ) (#53939 ) A WebUI/TUI session whose last turn died mid-tool-loop (stale-timeout kill, interrupt, or process restart before the tool result was written) persists a dangling assistant(tool_calls) or interrupted assistant->tool tail. The messaging gateway already strips these tails before replay (the #49201 fix), but the TUI/WebUI resume path fed db.get_messages_as_conversation() straight in as the agent's conversation_history with no cleanup. The model re-issued the unanswered call on every resume -- including after a full WebUI + Gateway restart, since the poison lives in the SessionDB, not memory -- leaving the session permanently 'thinking'. Only deleting the session recovered it. - Extract the two strippers + helper from gateway/run.py into a shared agent/replay_cleanup.py (sanitize_replay_history wraps both). - gateway/run.py re-exports under the historical private names; messaging behavior unchanged. - Both TUI cold-resume sites now sanitize the model-fed history while leaving the display transcript untouched, so the user still sees their full history. Verified E2E against a real SessionDB: dangling and interrupted tails are stripped from the model feed, healthy mid-progress tool sequences are preserved, and the display transcript is always the full raw history.	2026-06-27 20:56:49 -07:00
Teknium	f03823014b	fix(telegram): kill 409 polling conflict loop by disarming PTB retry synchronously (#53941 ) Telegram polling entered a self-inflicted ~31s loop of 409 Conflict -> retry -> resume -> Conflict. The error_callback PTB invokes synchronously inside its internal network_retry_loop only scheduled our async recovery task (loop.create_task) and returned, so PTB kept polling getUpdates on its own while our handler concurrently ran stop -> sleep -> start_polling. The two polling sessions overlapped and Telegram returned a fresh 409. Fix: in the conflict branch of the error_callback, synchronously set PTB's private polling stop_event before scheduling recovery. PTB's loop exits on its next tick (it races that event in do_action), so our handler owns polling alone. The handler's await updater.stop() drains the task and PTB clears the event, so the subsequent start_polling() builds a fresh event and is not poisoned. Keeps the existing reconnect ladder intact (option B) — fixes only the race. Defensive: probes mangled + unmangled stop_event spellings and no-ops (prior behaviour) if neither exists; never flips _running, which would make the handler skip stop() and leave the loop wedged.	2026-06-27 20:46:08 -07:00
Teknium	d43e0cf304	fix(agent): config-driven intent-ack continuation for all api_modes (#27881 ) (#53943 ) * fix(agent): config-driven intent-ack continuation for all api_modes (#27881) The agent could end a turn after only stating intent ('I will run a health check...') without executing the announced tool call, forcing the user to re-prompt. A continuation guard that catches this and nudges the model to proceed already existed but was hard-gated to the codex_responses api_mode, so Gemini/Claude/OpenRouter turns never benefited. - New agent.intent_ack_continuation config (default 'auto' = codex-only, byte-stable for existing conversations). 'true'/model-list opts every api_mode in; 'false' disables. Mirrors agent.tool_use_enforcement's shape. - looks_like_codex_intermediate_ack gains require_workspace (default True). The opted-in path drops the codebase/filesystem requirement so general autonomous workflows (server ops, deploys, API calls) are caught, not just coding tasks. Future-ack + action-verb + short-content + no-prior-tool guards still apply; the 2-nudge-per-turn cap is unchanged. - Resolution centralized in intent_ack_continuation_mode (off/codex_only/all). * docs(infographic): intent-ack continuation (#27881)	2026-06-27 20:46:00 -07:00
Teknium	56abbaeac3	fix(curator): fail closed on unverified skill deletes during consolidation (#53935 ) The curator's LLM consolidation pass could archive whole clusters of active skills with zero verified consolidations (#29912): a bare prune (skill_manage delete with absorbed_into empty/omitted) from the forked review agent was accepted, removing the skill's name from lookup even though counts.consolidated_this_run was 0. - _delete_skill now fails closed during the curator/background-review pass: a delete is only allowed when it declares a verified consolidation (absorbed_into=<umbrella>, umbrella must exist). A prune with no forwarding target is refused; the skill stays active. The deterministic inactivity prune (archive_skill) is unaffected. - A verified consolidation delete during the curator pass now routes through the recoverable archive primitive instead of shutil.rmtree, so a misjudged consolidation can be undone with hermes curator restore. The usage record is kept (state=archived) rather than forgotten. - Foreground, user-directed deletes keep their existing hard-delete semantics.	2026-06-27 20:45:57 -07:00
konsisumer	11b0be8d15	fix(gateway): avoid Matrix pending invite boot loops	2026-06-27 20:45:51 -07:00
teknium1	a1ac6baac4	fix(gateway): make bg-process reset TTL configurable + surface session-scoped processes Follow-up to the cherry-picked #29212 (#29177): - Promote the 24h stale-process threshold to config.yaml (session_reset.bg_process_max_age_hours) instead of a hardcoded constant. 0 disables the cutoff (legacy: any live process blocks reset). Wired through GatewayConfig.default_reset_policy in gateway/run.py. - Bug 2: process(action=list) now resolves the gateway session_key from the contextvar and surfaces session-scoped background processes (a forgotten preview server under a different task), flagged session_scoped — so the agent/user can discover and kill the blocker. Previously the task-scoped list returned [] and the blocker was invisible. - Tests: config round-trip for the new field, cross-task list visibility. - Docs: messaging session-reset section.	2026-06-27 20:45:43 -07:00
annguyenNous	33d8b66d5b	fix: stale background processes no longer permanently block session reset Background processes (e.g. http.server preview) that Hermes starts and forgets about previously blocked session idle/daily reset indefinitely. The reset guard in session.py checked has_active_for_session() with no max age — a 3-day-old preview server blocked reset the same as a task started 30 seconds ago. Changes: - Add max_active_age parameter to has_active_for_session() in process_registry.py. Processes older than this threshold are ignored. - Add MAX_ACTIVE_PROCESS_AGE constant (24h / 86400s). - Wire max_active_age into the gateway's session store callback in run.py so stale processes no longer block session lifecycle. - Add debug logging when reset is skipped due to active processes. - Add 3 tests covering recent, stale, and legacy (None) max age. Fixes #29177	2026-06-27 20:45:43 -07:00
teknium1	8c8967a50b	fix: defer hermes_subprocess_env import in browser_tool The module-level import broke tests/tools/test_managed_browserbase_and_modal.py, which loads browser_tool.py via spec_from_file_location against a stubbed 'tools' package that does not include tools.environments.local. Move the import into a _build_browser_env() helper called at the two agent-browser spawn sites, matching the lazy-import pattern already used by lazy_deps.py.	2026-06-27 20:45:31 -07:00
teknium1	9c6229ce24	fix(security): centralize credential-safe subprocess env (#29157 ) Subprocesses spawned outside the terminal/execute_code path (agent-browser, copilot ACP, dep-ensure, lazy_deps uv install, TUI Node host, cli.exec) inherited the operator's full credential environment via os.environ.copy(). The terminal path was already scrubbed by _HERMES_PROVIDER_ENV_BLOCKLIST (#1002/#1264/#32314); these spawn sites bypassed it. Adds hermes_subprocess_env(inherit_credentials=) in tools/environments/local.py reusing the existing dynamic blocklist as the single source of truth: - Tier 1 (_ALWAYS_STRIP_KEYS): gateway bot tokens, GitHub auth, infra secrets -- stripped even for credential-inheriting children. - Tier 2 (_HERMES_PROVIDER_ENV_BLOCKLIST): provider/tool keys -- stripped unless inherit_credentials=True. The opt-in is grep-able for audit. Browser worker keeps a _BROWSER_PASSTHROUGH_KEYS allowlist (BROWSERBASE/ FIRECRAWL) re-added after the strip. Model-driving children (ACP, TUI Node host, cli.exec) use inherit_credentials=True so they still get provider keys while losing Tier-1 secrets. Installers (dep-ensure, lazy_deps) inherit nothing sensitive. cua_backend already routed through _sanitize_subprocess_env on main -- left as-is. Gateway adapter utility spawns (gh pr comment, ffmpeg) are left inheriting env: gh needs GH_TOKEN by design, ffmpeg is a trusted system binary -- no untrusted-dependency exposure. This is defense-in-depth (personal-assistant trust model: same-user spawns), making the existing scrub policy uniform across the spawn surface; the main real payoff is shrinking the blast radius if a transitive npm dep in agent-browser is compromised. Reconstructed on current main from the design in #31959 (Tranquil-Flow); also credits #39003 (rodboev), #37843 (coygeek), #35769 (egilewski). Co-authored-by: Tranquil-Flow <tranquil_flow@protonmail.com> Co-authored-by: rodboev <rod.boev@gmail.com> Co-authored-by: egilewski <egilewski@egilewski.com>	2026-06-27 20:45:31 -07:00
Hermes Agent	88b3d8638e	test: de-flake SIGKILL-tree, compression-tip resume, and fallback-cooldown tests Three CI flakes hit while landing the credential-pool restore fix; all three were timing/wall-clock races in the tests, not product bugs (each passes locally and the assertions are correct): - test_entire_tree_is_sigkilled_not_just_parent: _terminate_host_pid SIGKILLs synchronously, but the test's 4s budget after a 1s in-function SIGTERM grace left almost no slack for the kernel to tear down 3 processes + reparent the children to zombies under loaded-CI scheduling. Widen the wait to 15s and make the liveness predicate tolerant of vanished-pid / zombie races. The assertion never weakens: every tree member must end up dead or zombie. - test_session_resume_follows_compression_tip: appended messages got time.time() timestamps (~now) while the test forced session started_at into the past, so the get_compression_tip MAX(m.timestamp) tiebreaker depended on wall-clock ordering. Pass explicit, well-separated message timestamps so the chain resolution is deterministic by construction. - test_non_retryable_exhaustion_arms_cooldown: asserted the short (5s) exhaustion cooldown with a tight +1.0s slack, which false-fails when wall-clock jitter between the 'before' snapshot and the cooldown computation exceeds a second on a loaded runner. Widen to +30s — still cleanly below the 60s rate-limit window it must distinguish from.	2026-06-27 20:04:45 -07:00
Jack Maloney	f0de4c6a47	fix(pool): re-select from credential pool on primary runtime restore _restore_primary_runtime restored the construction-time api_key snapshot and never consulted the credential pool. After the pool rotated away from a revoked/exhausted entry mid-session, every new turn restored the dead key, re-failed instantly, burned the remaining entries, and fell through to cross-provider fallback. After restoring the snapshot, re-select the pool's current best entry and swap the live credential in via _swap_credential (which already rebuilds the OpenAI/Anthropic client, reapplies base-url headers, and carries the #33163 base_url / OAuth-detection fixes). Falls back to the snapshot key when the pool is absent, empty, or the entry has no usable key. Salvaged from #25206 onto current main: the original targeted the pre-refactor monolithic method in run_agent.py; the logic now lives in agent/agent_runtime_helpers.py and is collapsed onto _swap_credential instead of re-inlining the client rebuild. Fixes #25205	2026-06-27 20:04:45 -07:00
teknium1	a590c5efdc	docs: add infographic for provider-precedence fix (#29285 )	2026-06-27 19:49:02 -07:00
kshitijk4poor	2af1678bfc	fix(auth): explicit provider intent beats stale OAuth active_provider (#29285 ) `resolve_provider("auto")` checked `auth.json` `active_provider` BEFORE the config.yaml `model.provider` and env-var API-key checks. So a user who was OAuth-logged-into one provider (e.g. Anthropic) but had set an explicit `model.provider` or exported an API key (e.g. `OPENAI_API_KEY`) was silently routed to the stale OAuth provider — the override was invisible and surprising. Reorder the auto-path so explicit intent wins (the order the issue asks for): 1. explicit CLI api_key/base_url 2. config.yaml `model.provider` (safety net — see below) 3. OPENAI_API_KEY / OPENROUTER_API_KEY env 4. OpenRouter credential pool 5. provider-specific API-key env vars 6. auth.json `active_provider` (OAuth) ← demoted to last-resort 7. AWS Bedrock credential chain 8. error `active_provider` is still honored — it's just a last-resort fallback chosen only when the user expressed no other preference, instead of overriding one. The normal chat/gateway/TUI/ACP/status path already resolves config.provider upstream in `resolve_requested_provider()` before "auto" is reached, so this duplicate config check is the safety net for the lone direct caller (`main.py` `resolve_provider("auto")`) and any future bypass. Because every surface funnels through this one resolver, the fix propagates everywhere with a single edit — no sibling path re-implements precedence. Also add a one-shot WARN when resolution lands on `active_provider` while a populated `model` config dict lacks a `provider` key — surfacing the silent override the issue reported without breaking first-install. Synthesizes the two competing PRs: #29615 (LifeJiggy — config-before-auth + the silent-override framing) and #29809 (Minksgo — the env-before-auth reorder). #29809 could not be merged directly (bundled unrelated, un-opt-in cost-tagging telemetry); its reorder idea is incorporated here and credited. Tests: tests/hermes_cli/test_provider_precedence.py — config/env beat stale OAuth, OAuth still used as last resort, explicit request short-circuits, WARN fires on silent fall-through. Full provider-resolution suites: 374 passed. Fixes #29285 Co-authored-by: LifeJiggy <141562589+LifeJiggy@users.noreply.github.com> Co-authored-by: Minksgo <153416856+Minksgo@users.noreply.github.com>	2026-06-27 19:49:02 -07:00
teknium1	2b73dd1ca6	fix(gateway): namespace --replace takeover marker by HERMES_HOME to stop cross-profile flap (#29092 ) Two profile gateway services sharing the default ~/.hermes resolve the takeover marker to the same path. A --replace from profile B could land in profile A's marker, match on PID + start_time by coincidence of a shared PID namespace, and make profile A exit 0 — only to be revived by systemd Restart=always, which races the replacer again, flapping indefinitely. write_takeover_marker now stamps replacer_hermes_home; the shared consume path rejects markers written under a different HERMES_HOME and leaves them in place for the correct profile. Absent field (older markers) is treated as same-home, so single-profile and mixed old/new deployments are unaffected. Salvaged from #31414 by @CryptoByz onto current main (branch was ~3962 commits behind; the consume function had since been refactored for issue #34597). Co-authored-by: CryptoByz.	2026-06-27 19:43:02 -07:00
Teknium	28ed883959	docs: add PR infographic for config-defaults fix	2026-06-27 19:38:11 -07:00
Teknium	45b2e4dd6b	fix(config): opt newer migrations out of default-stripping The salvaged #27354 fix made save_config strip schema-default leaves by default. Five migration sites added to main after the PR was authored still called bare save_config(config) and intentionally materialize a (often default-valued) key: model_catalog.ttl_hours, write_approval, curator.consolidate, agent.verify_on_stop, and the suspicious-MCP-server disable. Pass strip_defaults=False so those one-time deliberate writes survive, matching the opt-out the PR applied to the other migrations.	2026-06-27 19:38:11 -07:00
郝鹏宇	98488c4be4	fix(config): prevent save_config from materialising schema defaults Fixes #27354 Root cause: called during init (or by any code path that saves ) wrote injected schema defaults into config.yaml as if the user had authored them. Two fix layers: 1. now only injects when the user actually set somewhere (root or agent). A user who never set keeps it absent, so 's explicit-path detection won't treat it as user-authored. 2. gains a parameter and a new pass that removes keys matching unless those paths were explicitly present in the raw (pre-normalization) config on disk. Explicit-path detection uses on before any normalisation runs — preventing injected-in defaults from being mistaken for user-set values. All migration and edit-config call sites pass to preserve their intentional default-seeding behaviour. New helpers: - — collects leaf-key paths from a raw dict - — removes keys matching schema defaults Test coverage: 4 new regression tests (59 total, all passing).	2026-06-27 19:38:11 -07:00
teknium1	6dcc579bcb	test(streaming): repoint anthropic stream-cleanup test to close+rebuild path The existing test_anthropic_stream_parser_valueerror_retries_before_delivery asserted mock_replace.call_count == 1 — i.e. it passed precisely because the buggy OpenAI rebuild was invoked on the Anthropic path. Repoint it to assert the corrected close+rebuild-Anthropic behavior (#28161).	2026-06-27 19:37:33 -07:00
EloquentBrush0x	a0b9663c7c	fix(streaming): rebuild Anthropic client on stream cleanup instead of OpenAI client interruptible_streaming_api_call() has three connection-pool cleanup sites that called _replace_primary_openai_client() unconditionally. For api_mode=anthropic_messages this has two consequences: 1. _replace_primary_openai_client() fails (OPENAI_API_KEY unset on Anthropic-only configs), so dead connections are never purged. 2. The stale-stream detector's outer-poll site (L1977) is the only mechanism that can interrupt the worker thread while it blocks in for event in stream:. Because the Anthropic client is never closed, the thread stays blocked until the 900 s httpx read-timeout fires, producing a visible 15-minute hang for Telegram/gateway users on claude-opus-4-7. Fix: mirror the existing interrupt-path pattern (L1989-1997) at all three cleanup sites — if api_mode == "anthropic_messages", call _anthropic_client.close() + _rebuild_anthropic_client() instead of _replace_primary_openai_client(). _rebuild_anthropic_client() handles both direct Anthropic and Bedrock-hosted Claude correctly, unlike the inline build_anthropic_client() calls in open PR #14430. PR #14430 (open) covers only the outer stale-detector site (L1977). PR #23678 (open) covers only the inner retry sites (L1774, L1833). This PR covers all three sites and uses _rebuild_anthropic_client() for Bedrock parity. Fixes #28161	2026-06-27 19:37:33 -07:00
xxxigm	6f1a176b33	fix(gateway/discord): REST liveness probe to detect zombie clients (#26656 ) The Discord adapter could enter a silent zombie state after a network outage / proxy stall: the process is alive, _client looks open, but the underlying socket is dead. discord.py's WebSocket reconnect never sees a RST through a wedged proxy/NAT, so client.start() spins forever without exiting — which means the bot-task done callback (which only fires on task completion) never trips either. The bot stays "offline" in Discord until a manual `hermes gateway restart`. Reported offline for 13-17h. Adds an out-of-band REST liveness probe in DiscordAdapter. Every `discord.liveness_interval_seconds` (default 60s) the adapter issues a cheap fetch_user(bot_id) — the same REST path as message delivery, so it fails when the proxy/NAT is wedged. After `discord.liveness_failure_threshold` consecutive failures (default 3) the probe closes the wedged client and surfaces a retryable fatal error, which trips the gateway's existing _platform_reconnect_watcher and rebuilds the adapter. Operators disable it by setting either knob to 0. Config lives in config.yaml (discord.liveness_) per the .env-is-secrets policy; _apply_yaml_config bridges it to internal env vars the adapter reads, matching the existing HERMES_DISCORD_TEXT_BATCH_ pattern. Co-authored-by: Hermes Agent <agent@nousresearch.com>	2026-06-27 19:30:32 -07:00
teknium1	457c8a0a7c	fix(file-ops): keep worktree isolation when restoring preserved cwd (#26211 ) The durable _last_known_cwd anchor is keyed by the shared 'default' container, so a non-owning worktree session could inherit the owning session's cwd through it — breaking the wrong-worktree-routing fix (test_file_tools_cwd_resolution:: test_resolution_routes_to_resolving_sessions_worktree). Reorder _authoritative_workspace_root so the session-specific registered cwd override (keyed by raw session id) is checked BEFORE the shared-container _last_known_cwd fallback. A non-owning session now resolves into its own registered worktree; the durable anchor only fills in when there's no session-specific override (the #26211 single-session case). Adds a regression test covering the owner-mirrors-then-other-session-resolves interaction.	2026-06-27 19:29:06 -07:00
teknium1	b2faeba182	fix(file-ops): make preserved cwd reachable at write-time resolution (#26211 ) Belt-and-suspenders on top of the cherry-picked cwd-preservation fix: - Proactively mirror every live terminal cwd into _last_known_cwd on each successful read, so the durable anchor survives even when the cleanup thread pops both _file_ops_cache and _active_environments before _get_file_ops' stale-cache save branch can fire. - Fall back to _last_known_cwd in _authoritative_workspace_root. write_file_tool resolves the path (via _resolve_path_for_task) BEFORE _get_file_ops rebuilds the env, so restoring only the rebuilt env's cwd was insufficient — the resolution that decides where the file lands runs first. This closes that gap. The local env's persisted _cwd_file can't serve this role: it's keyed by a random per-session uuid and deleted on cleanup (the same cleanup that triggers the bug). The in-memory _last_known_cwd registry is the durable anchor instead. Adds a real-IO E2E regression (TestSilentFileMisplacementE2E) exercising the actual write_file_tool path after env cleanup.	2026-06-27 19:29:06 -07:00
zccyman	adeba1d7a8	fix(file-ops): preserve CWD across terminal environment re-creation (#26211 ) Root cause: when the terminal environment (`_active_environments` entry) is cleaned up and re-created during a long conversation, the new environment always starts with the default config CWD (typically `~/.hermes/hermes-agent`) instead of preserving the user's last-known working directory. Subsequent relative-path writes (`write_file`, `execute_code`, shell commands) silently land in the default CWD, making files appear to be "created but absent." Fix: add `_last_known_cwd` dict that preserves the old environment's CWD before the stale cache entry is invalidated. When a new environment is created for the same task_id, we check `_last_known_cwd` first and use the preserved CWD instead of the config default. Changes: - tools/file_tools.py: add `_last_known_cwd` dict, save CWD before stale cache invalidation, restore CWD on env recreation - tests/tools/test_file_tools.py: add `TestLastKnownCwd` with 2 tests verifying CWD preservation and fallback behavior Fixes #26211	2026-06-27 19:29:06 -07:00
teknium1	926a1b915d	fix(tools): suppress transient check_fn flakes so subagents keep file/terminal tools A flaky external probe in a tool's check_fn (e.g. check_terminal_requirements running `docker version` with a 5s timeout, momentarily timing out under load) would return False for a single get_tool_definitions() call. Because file tools delegate their check_fn to the terminal check, that one flake silently stripped read_file/write_file/patch/search_files AND terminal from whatever agent was being constructed at that instant — most visibly a delegate_task subagent, which then reported "Tool read_file does not exist". This explains both the intermittent (~80% success) user-session failures and the deterministic cron failures in #21658 / #5304. The existing _check_fn TTL cache made this worse: it cached the transient False for the full 30s window, poisoning every subagent spawned in that span. Fix: remember the last time each check_fn returned True; when a fresh probe fails within a short grace window of that success, treat it as a flake — serve the last-good True and do NOT cache the failure (so the next call re-probes). A failure with no recent success, or past the grace window, is honored normally so a backend that genuinely went down stops advertising its tools. Probe failures now log at WARNING regardless of quiet mode, making the previously-silent tool loss diagnosable in subagent (quiet) sessions. Co-authored-by: Stuart Horner <5261694+djstunami@users.noreply.github.com>	2026-06-27 19:29:00 -07:00
Shashwat Gokhe	505bc27d8d	fix(gateway): classify mixed attachments per-attachment + transcode uncommon image formats A document attached alongside an image in the same Discord message was swept into the vision pipeline and 400'd the whole turn ("Could not process image"), and was simultaneously never surfaced to the agent as a readable file. Restores the "any file type works" contract for mixed messages and fixes the HTTP 400. Bug 1 — mixed attachments: the inbound routing loop keyed image/audio/video classification off the message-level type (PHOTO/VOICE/AUDIO), so a doc in a PHOTO message landed in image_paths and poisoned the vision call. The document context-note path was gated on message_type == DOCUMENT, so that same doc never reached the agent at all. Now classification is per-attachment (trust each attachment's own MIME; fall back to the message-level type only when MIME is unknown), via shared _event_media_is_* helpers used by both _build_media_placeholder and the main inbound loop. The document note now fires for any non-image/audio/video attachment regardless of message-level type. Bug 2 — uncommon formats: AVIF/HEIC/BMP/TIFF/ICO produced the same generic 400 because providers only accept PNG/JPEG/GIF/WEBP. image_routing now transcodes those to PNG via Pillow before declaring media_type, skipping cleanly (logged) if Pillow/plugins are missing. SVG is vector — Pillow can't rasterize it — so it's skipped rather than transcoded. Closes #25935. Co-authored-by: LeonSGP43 <cine.dreamer.one@gmail.com> Co-authored-by: cypres0099 <74935762+cypres0099@users.noreply.github.com>	2026-06-27 19:26:04 -07:00
teknium1	0c372274cd	fix(agent): disable OpenAI SDK auto-retry that double-fires inside the rate-limit loop Same bug class as the Anthropic fix (#26293): the OpenAI/aggregator client is built without max_retries, so the SDK default of 2 applies. The SDK's own 1-2s backoff ignores Retry-After and retries inside hermes's outer conversation loop, burning request slots against a rate-limited bucket. Set max_retries=0 at the single create_openai_client chokepoint (covers init, switch_model, recovery, restore, request-scoped). auxiliary_client builds its own clients and is not wrapped by the loop, so it keeps SDK retries.	2026-06-27 19:23:15 -07:00
konsisumer	1ab35ba25d	fix(anthropic): stop SDK auto-retry double-firing and raise Retry-After cap to 600s The Anthropic SDK clients were built without max_retries, so the SDK default (max_retries=2) retried 429/5xx with its own backoff that ignores Retry-After — double-retrying inside hermes's outer loop and burning request slots against a bucket that won't refill for minutes. Set max_retries=0 on all Anthropic/AnthropicBedrock client constructions so the outer conversation loop (which already honors Retry-After) owns retry. Also raise the Retry-After cap in the conversation loop from 120s to 600s. Anthropic Tier 1 input-token buckets reset in ~171s, so the 120s cap made hermes retry before the reset window and re-trip the limit. Refs #26293	2026-06-27 19:23:15 -07:00
LeonSGP43	32732a8f83	fix(agent): cap same-entry credential refreshes so fallback can activate (#26080 ) A persistent upstream 401 on a single-entry OAuth pool (common for Claude Max subscribers) made the credential-pool recovery spin forever: try_refresh_current() re-mints a fresh token and reports success on every 401, so recover_with_credential_pool returned True and the retry loop continue'd without ever incrementing retry_count or reaching the auth-failover block. The configured fallback_model never activated and the agent appeared to hang. Cap consecutive successful same-entry refreshes (keyed by provider + pool-entry id) at 2; once exceeded, treat the credential as unrecoverable and return not-recovered so the loop falls through to _try_activate_fallback. The 429/billing paths already rotate-or-fall-through correctly (mark_exhausted_and_rotate returns None on a single entry), so only the auth-refresh branch needed the cap. Co-authored-by: Hermes Agent <hermes@nousresearch.com>	2026-06-27 19:20:07 -07:00
Teknium	fae920642a	fix(agent): throttle cross-turn fallback-switch replay storm (#24996 ) (#53909 ) When every provider in the fallback chain fails non-retryably back-to-back (e.g. HTTP 400/402/429 across distinct providers), the within-turn walk is already bounded — _fallback_index advances monotonically and the loop aborts when the chain exhausts. The damaging mode is cross-turn: restore_primary_ runtime resets _fallback_index=0 every turn, so a client that re-submits immediately replays the entire chain, re-marshaling the full (potentially 80k-token) context once per provider every turn with no throttle on the non-rate-limit path. On constrained hosts this exhausts memory/swap. Rate-limit/billing failures already arm a 60s cooldown via _rate_limited_until; the gap was the non-rate-limit case. Now, when the chain exhausts on a non- rate-limit failure with a non-empty chain, arm a short (5s) cooldown on the same _rate_limited_until gate (max(), never shrinking an existing window). The next turn's restore stays gated and does NOT reset the index, so the chain isn't replayed until the cooldown clears. No new state, no thread sleep, no false-trip on legitimately long chains (those walk normally within a turn). Tests: tests/run_agent/test_24996_fallback_exhaustion_cooldown.py	2026-06-27 19:15:40 -07:00
Chaz Dinkle	1dde7e2f2a	fix(anthropic): adopt Claude Code's already-refreshed token before racing refresh Claude Code OAuth refresh tokens are single-use; Claude Code refreshes on its own schedule, so by the time Hermes notices an expired token Claude Code may have already rotated it. Re-read live credential sources first and adopt a valid token rather than POSTing a possibly-stale refresh token. Ports the _refresh_oauth_token hardening from PR #40107 (chazmaniandinkle) on top of the keychain/file reconciliation from PR #21112 (nodejun). Adds AUTHOR_MAP entry for nodejun.	2026-06-27 19:14:43 -07:00
jun	5a5396aecb	fix(anthropic): reconcile keychain/file credentials when one is expired read_claude_code_credentials() previously returned the macOS Keychain entry as soon as one existed, even if its OAuth token was already expired. Callers then ran is_claude_code_token_valid() on the result and got False, so resolve_anthropic_token() returned None — surfacing the misleading 'No Anthropic credentials found' error even when ~/.claude/.credentials.json held a perfectly valid token. Now reads both sources and prefers the non-expired one. When both are valid (or both expired), prefers the later expiresAt so any subsequent refresh uses the freshest refresh_token. Adds TestReadClaudeCodeCredentialsDesync covering the four reconciliation cases. The existing 'keychain wins' priority test still passes because both fixtures share the same expiresAt and the tiebreaker is >=.	2026-06-27 19:14:43 -07:00
Teknium	db16854f34	fix(telegram): surface failed media downloads to user and agent, not a silent empty turn (#53912 ) When a Telegram attachment download/cache fails (typically a transient httpx.ConnectError to Telegram's CDN), the except handler logged a warning and fell through to handle_message() with empty media and no text — the user thought the file was delivered, the agent saw a content-less turn with no signal an attachment was attempted, and the only record was a buried log line. Adds _surface_media_cache_failure(): replies to the user in Telegram so they know to retry, and appends an agent-visible notice to event.text via the existing _append_observed_note channel so the agent knows an attachment was attempted and failed. No new event fields (structured-event refactor is out of scope per #23045). Wired into all five cache-failure sites — photo, voice, audio, video, document — since they shared the identical silent fall-through. Bug 1 from #23045 (unsupported types routed as fake user messages) no longer exists on main: the document handler now accepts any file type, so there is no rejection branch to fix. Closes #23045	2026-06-27 19:12:57 -07:00
teknium1	4133cd9fbf	docs(infographic): eager fallback on persistent transport failures	2026-06-27 19:12:21 -07:00
teknium1	6514be5a28	chore(release): add AUTHOR_MAP entry for linyubin (#50228 salvage)	2026-06-27 19:12:21 -07:00
linyubin	c946e6709f	fix(agent): activate fallback on persistent transport failures (#22277 ) Eager fallback previously fired only on rate_limit/billing. A stale- detector-killed hung stream classifies as FailoverReason.timeout (retryable=True) and the retry loop re-hit the same dead primary until the budget exhausted -- 3 x ~180-300s stale kills compounding into a 15+ min silent hang while the configured fallback chain sat idle. Extend the existing eager-fallback gate to also cover timeout and overloaded, but only after one real retry (retry_count >= 2) so genuine transient hiccups still recover on the primary. Reuses the same pool-recovery guard and state-reset as the rate_limit branch -- no new config flag, no change to the rate-limit intent. Salvaged from PR #50228 by @linyubin. Closes #22277. Co-authored-by: Hermes Agent <127238744+teknium1@users.noreply.github.com>	2026-06-27 19:12:21 -07:00
bykim0119	851f75d4df	fix(discord): honor "" wildcard in DISCORD_ALLOWED_USERS (#22334 ) DISCORD_ALLOWED_USERS="" now means "allow everyone", matching the SIGNAL_ALLOWED_USERS / DISCORD_ALLOWED_CHANNELS wildcard convention and the value `claw migrate` emits. Previously _is_allowed_user did exact ID matching only, so "" matched no user and blocked every non-self sender — a P1 with no workaround. Three sites, all required for the fix to hold at runtime: - _is_allowed_user: short-circuit when "" is in the allowlist. - connect(): exclude "" from the intents.members trigger so the wildcard does not request the privileged Server Members intent (which can block the bot from coming online). - _resolve_allowed_usernames: preserve "" verbatim; otherwise it lands in the username-resolution bucket, matches no member, and is silently dropped from the set and env var on the first on_ready — quietly undoing the fix. Slash auth delegates to _is_allowed_user (auto-covered); component auth already honors "*" on main.	2026-06-27 19:11:30 -07:00
Teknium	1207d81eed	fix(gateway): unify outbound chat redaction onto authoritative redactor (#23810 ) (#53907 ) The gateway banner promises 'chat responses are scrubbed before delivery', but _redact_gateway_user_facing_secrets used a divergent 6-pattern subset that leaked credential shapes the comprehensive agent.redact catches — notably the GitHub fine-grained PAT (github_pat_...) and the Telegram bot-token shape (bot<digits>:<token>), the gateway's own credential type. _redact_gateway_user_facing_secrets now delegates to agent.redact.redact_sensitive_text(force=True) — the same Tirith-grade redactor already applied to logs, tool output, and approval-command prompts — so the outbound LLM-response path (final_response -> _sanitize_gateway_final_response) masks the full credential set. The narrow local pattern set is kept as a fail-soft second pass. force=True honors redaction even when security.redact_secrets is off, matching _redact_approval_command. Test: regression guard parametrizing all 5 issue shapes x every chat surface; asserts secret body never reaches the user and surrounding prose survives. The existing bearer-token test's marker assertion is loosened from the literal '[REDACTED]' to mask-agnostic (the redactor masks as '***'/partial) — it asserts the security invariant, not the implementation's mask string.	2026-06-27 19:09:41 -07:00
LeonSGP43	c56b39c11e	fix(auxiliary): fall back to OPENROUTER_API_KEY when credential pool exhausted _try_openrouter() returned (None, None) whenever an OpenRouter credential pool existed but was exhausted (_select_pool_entry -> (True, None)), making the OPENROUTER_API_KEY env-var fallback unreachable. Auxiliary tasks (compression, vision, web_extract) silently failed even with a valid env key. Now the pool-present branch only returns early when it successfully builds a client; an exhausted pool falls through to the env-var path. The final failure (pool exhausted AND no env var) still marks the provider unhealthy. Fixes #23452. Co-authored-by: ambition0802 <noreply@github.com>	2026-06-27 19:09:27 -07:00
qWaitCrypto	46e18804ad	fix(auxiliary): fall back on 401 auth errors in auto mode (#21165 ) When the primary provider returns 401 and the auth-refresh path is unavailable or fails, both call_llm() and async_call_llm() reached the should_fallback gate without _is_auth_error in the condition, so the auxiliary task (e.g. compression) was dropped silently — losing message history. Add _is_auth_error to should_fallback (NOT is_capacity_error) in both sync and async paths, plus an 'auth error' reason branch. Auth stays a non-capacity error: it falls back in auto mode via the is_auto gate, but on an explicitly-configured provider it still respects the user's choice and raises rather than silently switching providers.	2026-06-27 19:07:04 -07:00

1 2 3 4 5 ...

13242 commits