hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-07-01 12:02:05 +00:00

Author	SHA1	Message	Date
Ben Barclay	b963d3238b	feat(gateway): suppress home-channel shutdown broadcast on flagged drains (#54824 ) Add a generic suppress_notification flag to the drain-request marker. When a drain that ends in process exit (e.g. a NAS auto-update image migration on the always-on Hermes Cloud fleet) is flagged, the gateway skips ONLY the home-channel 'gateway shutting down' broadcast — the operator-flavoured ping that would otherwise fire on every routine auto-update, dozens of times a day. The per-active-session interrupt ping is ALWAYS kept: on a drained shutdown it's empty by construction, and in the force-interrupt (deadline-exceeded) case it carries the user-valuable 'your task was cut off, message me to resume' hint. The gateway stays agnostic about WHY a drain is quiet (generic boolean, not a kind enum); the policy of which drain causes set the flag lives in the caller (NAS). Default-false so legacy/operator drains behave exactly as before. The reader reuses the NS-570 epoch-staleness check so an orphaned marker on the durable volume can never silence a fresh gateway's legitimate broadcast. - drain_control.py: write_drain_request gains suppress_notification; new drain_notification_suppressed() reader (current-epoch + truthy flag). - web_server.py: /api/gateway/drain reads + echoes the flag. - run.py: _notify_active_sessions_of_shutdown skips the home-channel loop only. Tests prove: flag round-trips; home-channel suppressed when set, kept when unset; active-session ping always fires; stale/legacy/corrupt markers never suppress.	2026-06-29 12:18:11 -07:00
Teknium	dbad6d47d3	fix(gateway): also neutralize untrusted Matrix room name in prompt Widen #5961's _format_untrusted_prompt_value coverage to the Matrix room display name (Matrix Room:), a sibling attacker-controllable field the original fix missed. chat_name is user-settable, so an injected room name could render as literal markdown in the system prompt. Adds a regression test.	2026-06-29 04:25:51 -07:00
Xowiek	09666ceb76	fix(gateway): neutralize untrusted session metadata in prompts	2026-06-29 04:25:51 -07:00
teknium1	ea1372d2af	fix(security): wire session-id sanitizer into artifact paths + API boundary Defense-in-depth on top of _safe_session_filename_component (#5958): Sink (makes the bad write impossible regardless of entry point): - run_agent._save_session_log: sanitize session_id before building the session_{sid}.json snapshot path. - agent_runtime_helpers.dump_api_request_debug: sanitize before building the request_dump_{sid}_{ts}.json path. Boundary (clean 400 instead of a silently-hashed filename): - api_server rejects path-traversal-shaped X-Hermes-Session-Id on the session-continuation path and the explicit /api/sessions create path, reusing gateway.session._is_path_unsafe (mirrors the native gateway's entry-boundary guard). Also enforces the session-header length cap on the continuation path. Tests: traversal session_id stays contained at the write site; sanitizer always yields a traversal-free segment; the API header rejects ../, absolute, and Windows-traversal IDs with 400.	2026-06-29 04:25:45 -07:00
Mibayy	0fe9755016	fix(gateway): use last_prompt_tokens for session-reset activity check reset_had_activity gated on entry.total_tokens, which is never written (token counts migrated to agent-direct persistence) so it was always 0. That suppressed session-reset notifications for sessions that genuinely had activity. Switch to last_prompt_tokens, which is updated on every turn.	2026-06-29 04:25:37 -07:00
sgaofen	194bff0687	fix(gateway): confirm final delivery before suppressing send Fixes #14238. During a compression/session split at the response boundary, the interim callback delivered unrelated commentary, setting response_previewed=True. The suppression logic treated that as proof the final reply had been delivered and skipped the normal send — the response was persisted to the child session but never sent to chat. Only suppress the normal final send when the stream consumer confirms final delivery (final_response_sent / final_content_delivered) or the exact final response text was delivered as a preview.	2026-06-29 02:37:11 -07:00
teknium1	0b733a8418	test(gateway): pin auto-reset cached-agent eviction (#10710 ) Relocate marco0158's eviction into the dedicated auto-reset cleanup block (single source of truth for dropping session-scoped transient state) and add an AST invariant pinning _evict_cached_agent into that block. Add AUTHOR_MAP entry for marco0158.	2026-06-28 22:35:17 -07:00
marco0158	b4300f2d96	fix(gateway): evict cached agent on auto-reset to prevent stale context summary leak When a session is auto-reset by daily schedule, idle timeout, or suspended state, the agent cache was not being cleared. This caused the old agent's context_compressor._previous_summary to leak into the new session, mixing old conversation history into new compaction summaries. This was the root cause of the "skin making history" appearing after compaction in fresh sessions reported by the user. Follow-up to #9893 which only handled compression_exhausted case. Changes: - Add _evict_cached_agent(session_key) call after was_auto_reset check - Covers daily, idle, and suspended auto-reset scenarios - Matches the behavior of manual /reset command Related tests: test_session_boundary_hooks, test_async_memory_flush, test_session_reset_notify, test_session_reset_fix - all passing.	2026-06-28 22:35:17 -07:00
Junass1	61a4526ac7	fix(gateway): clear session-scoped model overrides on /resume /resume is a conversation boundary, but unlike /new it did not clear the chat-keyed _session_model_overrides / _pending_model_notes. A /model switch made in the previous session under the same chat session_key leaked into the resumed conversation, running it on the wrong model. Clear both maps for the session_key after the switch (mirroring /new), scoped to that key so other chats' overrides are untouched. The cached-agent eviction this leak also implied already landed via #6672. Closes #10702.	2026-06-28 22:35:12 -07:00
aaronagent	27ddd8fd80	fix(gateway): sanitize agent error messages, validate webhook gh args Two of the three fixes from PR #6660 (the cli.py reopen_session change is moot — that raw _conn.execute reopen block no longer exists on main). - gateway/run.py: stop sending raw type(e).__name__ and str(e)[:300] to end users on chat platforms. Exception text from LLM providers can leak API URLs, file paths, and partial credentials. Return a generic message; keep curated status hints for known HTTP codes; full detail stays in logs. - gateway/platforms/webhook.py: validate pr_number (positive int) and repo (owner/name regex) before passing to the 'gh pr comment' subprocess. Payload-controlled values could otherwise inject gh flags (--help, a different --repo). List-form subprocess means this is arg injection, not shell injection, but validation is still correct. Co-authored-by: aaronagent <1115117931@qq.com>	2026-06-28 18:53:26 -07:00
Teknium	f1cbe4308f	fix(gateway): log error-notification failures instead of silently swallowing (#54472 ) * fix(gateway): log error-notification failures instead of silently swallowing The last-resort exception handler in _process_message_background() that sends an error notice to the user caught all exceptions with a bare pass, leaving zero trace when the notification itself failed. Upgrade to logger.error(..., exc_info=True) so a failed error-notification send is debuggable post-mortem. Salvaged from #6499 by @BongSuCHOI (the logging-upgrade portion only). * docs: add PR infographic for gateway error-notify logging	2026-06-28 18:52:51 -07:00
Teknium	d65468e7ff	fix(security): SSRF guard yuanbao media download_url (#54470 ) yuanbao_media.download_url() fetched model-supplied (outbound) and inbound image/file URLs server-side via httpx with follow_redirects=True and no SSRF check. A model response containing <img src="http://169.254.169.254/..."> routed through ImageUrlHandler -> download_url and would fetch cloud-metadata endpoints; same for inbound media. Add an is_safe_url() pre-flight plus an async redirect event-hook that re-validates every 30x target, matching the cache_image_from_url() guard in gateway/platforms/base.py. The other gateway adapters already guard their URL-fetch paths; this was the remaining unguarded one.	2026-06-28 15:29:59 -07:00
Teknium	95f2919f91	perf(startup): lazy-load gateway platform adapters (#54448 ) Bundled platform plugins (telegram, discord, feishu, teams, ...) were eagerly imported at plugin-discovery time on every `hermes` invocation, including plain `hermes chat` which never touches a gateway platform. Their modules import heavy platform SDKs at module level (lark_oapi, microsoft_teams, discord.py, slack_bolt, ...) — feishu alone pulled in lark_oapi (~2.6s), teams pulled microsoft_teams (~1.9s). Discovery now registers a cheap deferred loader per platform in the platform_registry; the adapter module is imported only when the gateway / cron / setup / send_message path actually asks for that platform. is_registered() and the iterate-all accessors stay correct (deferred counts as registered; plugin_entries()/all_entries() materialize all deferred loaders, since those paths genuinely need every adapter). Cold start: ~4.4s -> ~2.45s to banner. discover_and_load: 2.0s -> 0.3s (warm), and the heavy SDKs are no longer imported at all in CLI mode. Every shipped platform remains available out of the box — it just loads on first use.	2026-06-28 15:11:59 -07:00
Teknium	86e64900b9	fix(gateway): preserve sessions across restarts (#54442 )	2026-06-28 15:10:39 -07:00
Teknium	9a0010fd46	fix(windows): cover remaining console-flash spawn legs (#54417 )	2026-06-28 13:49:08 -07:00
Teknium	cb982ad997	fix(windows): hide console-window flash on backend git/gh/wmic/bash subprocess spawns The Windows desktop GUI runs its backend headless via pythonw.exe. Several auxiliary subprocess sites that run inside that windowless backend spawned console-subsystem children (git, gh, wmic, powershell, bash, rg, taskkill) WITHOUT CREATE_NO_WINDOW, so Windows allocated a fresh conhost per call and flashed a black window on screen — sometimes continuously (the dashboard Projects-tree git probe alone fired ~118 spawns in 60s on startup). The terminal tool, cron, browser, code_execution, and gateway-spawn paths already carry windows_hide_flags(); these auxiliary probe/scan/launcher legs were missed. Wire the existing helper into them: - tui_gateway/git_probe.py: run_git (+ encoding=utf-8/errors=replace, fixes the cp950 UnicodeDecodeError on CJK paths from the same site) - agent/coding_context.py: _git (per-turn git status/log/diff) - agent/context_references.py: _run_git + _rg_files (@file/@ref resolution) - hermes_cli/copilot_auth.py: gh auth token probe (auxiliary provider:auto) - hermes_cli/gateway.py: wmic + PowerShell Get-CimInstance PID scan - hermes_cli/main.py: wmic stale-dashboard PID scan - gateway/status.py: taskkill /T /F force-kill windows_hide_flags() returns 0 on POSIX, so every changed call is a no-op on Linux/macOS (verified: real git/rg probes still work; Windows-simulated calls all pass creationflags=CREATE_NO_WINDOW). Scoped to the windowless-backend paths that cause the reported flashing. The Electron updater-handoff leg (main.cjs windowsHide:false) and the interactive-CLI banner probes (cli.py) are intentionally NOT touched here — the former needs a Windows-tested change of its own, the latter runs in a visible console anyway. Tracking: #54220 Refs: #53178 #53631 #53781 #53957 #49602 #52982 #53424 #53053 #53016	2026-06-28 05:28:45 -07:00
liuhao1024	9d919daf44	fix(gateway): mark platform lock failure as retryable instead of permanently fatal When a stale lock file survives a gateway crash, `acquire_scoped_lock()` may return `(False, existing_dict)` even after detecting and deleting the stale lock (e.g. if unlink fails or a race condition occurs). Previously, `_acquire_platform_lock()` called `_set_fatal_error(..., retryable=False)`, which permanently killed the platform — the reconnect watcher never retries a non-retryable fatal error. Change to `retryable=True` so the platform enters the "retrying" state and the reconnect watcher can attempt acquisition again after the standard backoff delay. Fixes #54167	2026-06-28 04:35:37 -07:00
tymrtn	d7f655f370	fix: accept typed clarify choice replies	2026-06-28 04:13:19 -07:00
teknium1	9bb5a809b5	fix(gateway): make zombie check defensive against partial psutil stubs The zombie status probe referenced psutil.Process/NoSuchProcess/Error unconditionally, which raised AttributeError when psutil is a partial stub that only defines pid_exists (as in test_windows_native_support's fallback tests). Guard the probe so any failure to read status degrades to the authoritative pid_exists() instead of raising.	2026-06-28 04:11:14 -07:00
MorAlekss	acca526286	fix(gateway): treat zombie PIDs as dead in _pid_exists to unblock --replace (closes #42126 ) Under systemd Restart=always, the old gateway becomes a zombie (in the process table, awaiting reap) when the replacement starts. _pid_exists() reported the zombie as alive, so --replace waited on a PID that never dies, then aborted with exit 1 — a silent crash loop. Standalone runs are unaffected because nothing respawns the gateway into a zombie. The live path is psutil.pid_exists(), which returns True for zombies, so the check is added there (Process.status() == STATUS_ZOMBIE -> dead). The psutil-less POSIX fallback also reads /proc/<pid>/stat (state Z) with a ps state= fallback for macOS/BSD, before the os.kill(pid, 0) liveness probe. Diagnosis and the /proc + ps POSIX fallback by MorAlekss (PR #44898); extended to cover the psutil hot path so the fix applies on normal installs. Co-authored-by: MorAlekss <mor.aleksandr@yahoo.com>	2026-06-28 04:11:14 -07:00
Teknium	c1c179a239	fix(security): redact secrets in background process + foreground env-dump output (#43025 ) (#54149 ) * fix(security): redact secrets in background process + foreground env-dump output Terminal-output redaction was incomplete (#43025): - Gap 1: process(action=poll/log/wait) returned background stdout verbatim — no redaction at all. A background printenv/server/test emitting a key leaked raw to the model, session.db, and CLI display. Same for the gateway background-process watcher's completion/progress notifications. - Gap 2: the foreground terminal path hardcoded code_file=True, which skips the ENV-assignment pass, so an opaque token (no vendor prefix) from env/printenv leaked even there. Adds agent.redact.redact_terminal_output(output, command) as the single policy for ALL terminal-output surfaces: env-dump commands (env/printenv/set/export/ declare) get the ENV-assignment pass (code_file=False) to mask opaque tokens; other commands stay on code_file=True to avoid false positives on source dumps. Wired into terminal_tool, process_registry (_handle_process boundary), and the gateway watcher. Respects security.redact_secrets (no force) — opt-out preserved. * docs: add infographic for #43025 terminal-output redaction fix	2026-06-28 02:44:21 -07:00
teknium1	9844243b18	fix(gateway): gate quick_commands through slash access policy Config-backed quick_commands bypassed the admin-only slash gate. The early gate in _handle_message only fires for registry-known commands (is_gateway_known_command), but quick_commands are never in the gateway registry, so they reached the type:exec dispatch sink unchecked. An allowlisted non-admin gateway user could invoke admin-only quick commands — including shell exec in the gateway process — even when the operator set allow_admin_from / user_allowed_commands to lock them out. Apply _check_slash_access(source, command) at the quick_commands dispatch site (the single exec chokepoint, cold-path only) using the raw typed name. Admins and users with the command in user_allowed_commands still run it; backward-compat (no policy set) is unaffected. Fixes #44727. Co-authored-by: maxpetrusenko <max.petrusenko.agent@gmail.com> Co-authored-by: zapabob <1920071390@campus.ouj.ac.jp>	2026-06-28 02:43:23 -07:00
Teknium	00d8c2c915	fix(gateway): prune stale sessions.json entries on startup A hard gateway crash (exit code 1) skips the graceful shutdown path, so sessions.json is never cleared and is left pointing at sessions already ended in state.db. On the next startup get_or_create_session() reuses those stale entries as long as the time/policy reset checks pass — it never consults end_reason — so every incoming message is silently routed into a closed session, with no log or error (#52804). SessionStore._ensure_loaded_locked() now calls a new _prune_stale_sessions_locked() that drops any entry whose session_id has end_reason IS NOT NULL in state.db. Idempotent, _db=None / legacy-absent safe, DB errors non-fatal, sessions.json rewritten only when something was pruned. Self-heals into a fresh session on the next message. Reported and diagnosed by @terry197913 (#52808).	2026-06-28 02:41:47 -07:00
teknium1	ea5aaa7a22	fix(gateway): offload remaining inline agent cleanup off the event loop (#53175 ) #35994 moved /new reset cleanup off the loop, but _cleanup_agent_resources (agent.close() subprocess teardown; shutdown_memory_provider() plugin IO) was still called INLINE on the event loop from three other sites: - _session_expiry_watcher (5-min idle sweep) — live loop - _handle_message_with_agent cache-hygiene re-eviction — live loop - _finalize_shutdown_agents / stop() idle-cache loop — shutdown A wedged memory provider on any of these froze the loop: bot goes silent, runtime-status updated_at heartbeat stops advancing, and SIGTERM can't be serviced (requires kill -9) — exactly the #53175 zombie pattern. Adds _cleanup_agent_resources_off_loop: a bounded (30s) worker-thread offload mirroring the #35994 reset fix, and routes all four sites through it.	2026-06-28 02:41:36 -07:00
LeonSGP43	9f0e64cedd	fix(gateway): force exit after graceful shutdown Co-Authored-By: Paperclip <noreply@paperclip.ing>	2026-06-28 02:34:23 -07:00
teknium1	c23f394eb8	fix: satisfy ruff encoding + windows-footgun lints for cgroup reaper - read_text(encoding='utf-8') (PLW1514) - # windows-footgun: ok on signal.SIGKILL — module is Linux-only (reads /proc, /sys/fs/cgroup; runs from a systemd unit) - test lambda accepts the new encoding kwarg	2026-06-28 02:05:50 -07:00
PRATHAMESH75	e551da6ddb	fix(gateway): reap cgroup orphans via ExecStopPost to unblock restart Long-lived helpers spawned indirectly by tool calls (adb, platform bridges) were left in the service cgroup after the gateway's main process exited. When the kernel rejected the deferred cgroup-wide kill with EINVAL, systemd blocked Restart=always for 6+ minutes, taking down all platforms and cron windows (#37454). Add a small ExecStopPost helper (gateway.cgroup_cleanup) that walks cgroup.procs and sends per-PID SIGKILLs — a different kernel code path than cgroup.kill, so it succeeds where the cgroup-wide write failed. KillMode=mixed is preserved so the gateway still reaps its own tool-call children before systemd intervenes (#8202). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-28 02:05:50 -07:00
teknium1	58c36b1798	fix(api-server): widen error redaction to cron-endpoint + SSE sites Follow-up to the salvaged #37733 fix. The contributor centralized redaction at _openai_error and the chat/responses failure paths, which covers the OpenAI-compatible envelopes transitively. Two sibling classes crossed the same authenticated HTTP boundary unredacted: - 8x cron-management endpoints returning {"error": str(e)} on 500 - the session-chat SSE error event ({"message": str(exc)}) Route both through the same _redact_api_error_text(force=True) helper. Add AUTHOR_MAP entry for coygeek and a TestRedactApiErrorText guard covering mask/force/limit/passthrough behavior.	2026-06-28 02:05:38 -07:00
Coy Geek	5e774de76e	fix(api-server): redact provider errors at HTTP boundary Force API-server error text through the existing secret redactor before returning OpenAI-compatible errors, response fallback text, response snapshots, and run failure events. This prevents credential-shaped provider failure text from crossing the API-server boundary while preserving debuggable sanitized messages.	2026-06-28 02:05:38 -07:00
HexLab98	d2ea948bc0	fix(gateway): suppress compression status noise on Discord and other chats (#39293 ) Extend the gateway noisy-status filter beyond Telegram so internal compression lifecycle messages stay in logs instead of spamming Discord, Slack, and other messaging channels.	2026-06-28 14:35:32 +05:30
fesalfayed	263ffec1b0	fix(whatsapp): resolve LID aliases on modern platforms/ session layout expand_whatsapp_aliases hardcoded get_hermes_home()/whatsapp/session, but the adapter writes lid-mapping files via get_hermes_dir("platforms/whatsapp/ session", "whatsapp/session"). On installs without the legacy directory the two paths diverge, so the resolver finds no mappings and returns the bare LID, which misses the allowlist and silently drops the message. Resolve through the same helper so both sides stay in lockstep on new and legacy layouts.	2026-06-28 02:05:26 -07:00
teknium1	64972b6403	fix(config): canonicalize model.name/model.model to model.default (#34500 ) A custom_providers config that names the model under model.name (or model.model) resolved to an empty model, so the API request went out with model= — HTTP 400 from OpenAI-compatible backends. Display paths (hermes status/dump) already read model.name and showed the model, making the failure silent. The model id was read via 'default or model' at ~14 independent sites (cli, gateway, cron, curator, oneshot, fallback, profiles, ...), none of which honored 'name'. Rather than patch every site, canonicalize at the single load/save chokepoint: _normalize_root_model_keys() now promotes model.model/model.name -> model.default (precedence default > model > name) and drops the stale alias, so every reader — present and future — sees a populated default and config.yaml is migrated canonical on next save. The gateway, which bypasses load_config(), replays the same normalization in _load_gateway_config(). Co-authored-by: Bartok9 <danielrpike9@gmail.com> Credit: root-cause analysis and fix direction from @Bartok9 (#34502, first) and @v86861062 (#34527).	2026-06-28 02:05:13 -07:00
Teknium	c9df4bc094	fix(gateway): default restart_drain_timeout to 0 to kill systemd crash loop (#54066 ) A restart now interrupts in-flight agents immediately rather than holding the gateway open for a grace window. The previous 180s default coupled two independently-set timers: the gateway's own drain timer and systemd's TimeoutStopSec. On a stale unit where TimeoutStopSec < drain, systemd SIGKILLed the gateway mid-cleanup, leaving a stale lock that made the next startup exit immediately ('already running') — an infinite crash loop under Restart=on-failure (#31981). Setting drain to 0 makes the mismatch structurally impossible: with drain 0 the generated unit gets TimeoutStopSec=90 against a near-instant drain, so systemd never kills mid-cleanup. Contract: restart the gateway, in-flight work stops. A grace window large enough to 'save' a long agent turn would have to outlast an unbounded task, which is impossible. Also fixes the stale-unit warning's suggested command (hermes gateway service install --replace -> hermes gateway install --force); the former subcommand does not exist. Closes #31981	2026-06-28 01:14:34 -07:00
Teknium	90d25adc9e	fix(gateway): deliver profile-scoped cache media on symlinked HERMES_HOME (#54060 ) Generated images under a profile gateway's cache (profiles/<name>/cache/ images/...) were silently dropped from Telegram/Discord delivery when HERMES_HOME is symlinked under a denied prefix (e.g. /opt/data -> /root/.hermes) and $HOME is not that prefix. The resolved path lands under /root (a system denylist prefix), the root-home exception only fires when the denied prefix IS $HOME, and the static safe-roots list only covers the active HERMES_HOME's top-level cache — not per-profile cache dirs. Both gates fail, so validate_media_delivery_path returns None and the gateway logs 'Skipping unsafe MEDIA directive path'. _media_delivery_allowed_roots() now also enumerates per-profile cache roots (<root>/profiles/*/cache/{images,audio,videos,documents, screenshots}) at check time. Allowlist match runs before the denylist, so the profile artifact delivers regardless of the /root interaction; profile-dir credentials (auth.json) stay blocked since they aren't under a cache subdir. Reopened regression of #34485/#38108, neither of which covered the profile-scoped symlink case. Fixes #31733.	2026-06-28 01:07:28 -07:00
sweetcornna	fc70d023d8	fix(telegram): apply bot auth policy to Telegram sources # Conflicts: # gateway/config.py	2026-06-28 00:57:03 -07:00
infinitycrew39	e860a40e14	fix(agent,gateway): surface partial-stream recovery and bound detached restart Salvage of NousResearch/hermes-agent#41498 (0-CYBERDYNE-SYSTEMS-0). - Leave response_previewed false on partial_stream_recovery so gateway fallback delivery can send the recovered fragment plus explanation. - Always append the turn-completion explainer for partial_stream_recovery, not only for empty or very short fragments (#34452 gap). - Launch the detached /restart helper before drain, idempotently, with a bounded wait of restart_drain_timeout + 5s.	2026-06-27 22:03:14 -07:00
Priyanshu Sharma	f6deabca0d	fix(gateway): clear stale base_url on model switches	2026-06-27 21:23:25 -07:00
Teknium	a8c862900b	fix(tui): sanitize replay history on WebUI/TUI session resume (#29086 ) (#53939 ) A WebUI/TUI session whose last turn died mid-tool-loop (stale-timeout kill, interrupt, or process restart before the tool result was written) persists a dangling assistant(tool_calls) or interrupted assistant->tool tail. The messaging gateway already strips these tails before replay (the #49201 fix), but the TUI/WebUI resume path fed db.get_messages_as_conversation() straight in as the agent's conversation_history with no cleanup. The model re-issued the unanswered call on every resume -- including after a full WebUI + Gateway restart, since the poison lives in the SessionDB, not memory -- leaving the session permanently 'thinking'. Only deleting the session recovered it. - Extract the two strippers + helper from gateway/run.py into a shared agent/replay_cleanup.py (sanitize_replay_history wraps both). - gateway/run.py re-exports under the historical private names; messaging behavior unchanged. - Both TUI cold-resume sites now sanitize the model-fed history while leaving the display transcript untouched, so the user still sees their full history. Verified E2E against a real SessionDB: dangling and interrupted tails are stripped from the model feed, healthy mid-progress tool sequences are preserved, and the display transcript is always the full raw history.	2026-06-27 20:56:49 -07:00
teknium1	a1ac6baac4	fix(gateway): make bg-process reset TTL configurable + surface session-scoped processes Follow-up to the cherry-picked #29212 (#29177): - Promote the 24h stale-process threshold to config.yaml (session_reset.bg_process_max_age_hours) instead of a hardcoded constant. 0 disables the cutoff (legacy: any live process blocks reset). Wired through GatewayConfig.default_reset_policy in gateway/run.py. - Bug 2: process(action=list) now resolves the gateway session_key from the contextvar and surfaces session-scoped background processes (a forgotten preview server under a different task), flagged session_scoped — so the agent/user can discover and kill the blocker. Previously the task-scoped list returned [] and the blocker was invisible. - Tests: config round-trip for the new field, cross-task list visibility. - Docs: messaging session-reset section.	2026-06-27 20:45:43 -07:00
annguyenNous	33d8b66d5b	fix: stale background processes no longer permanently block session reset Background processes (e.g. http.server preview) that Hermes starts and forgets about previously blocked session idle/daily reset indefinitely. The reset guard in session.py checked has_active_for_session() with no max age — a 3-day-old preview server blocked reset the same as a task started 30 seconds ago. Changes: - Add max_active_age parameter to has_active_for_session() in process_registry.py. Processes older than this threshold are ignored. - Add MAX_ACTIVE_PROCESS_AGE constant (24h / 86400s). - Wire max_active_age into the gateway's session store callback in run.py so stale processes no longer block session lifecycle. - Add debug logging when reset is skipped due to active processes. - Add 3 tests covering recent, stale, and legacy (None) max age. Fixes #29177	2026-06-27 20:45:43 -07:00
teknium1	2b73dd1ca6	fix(gateway): namespace --replace takeover marker by HERMES_HOME to stop cross-profile flap (#29092 ) Two profile gateway services sharing the default ~/.hermes resolve the takeover marker to the same path. A --replace from profile B could land in profile A's marker, match on PID + start_time by coincidence of a shared PID namespace, and make profile A exit 0 — only to be revived by systemd Restart=always, which races the replacer again, flapping indefinitely. write_takeover_marker now stamps replacer_hermes_home; the shared consume path rejects markers written under a different HERMES_HOME and leaves them in place for the correct profile. Absent field (older markers) is treated as same-home, so single-profile and mixed old/new deployments are unaffected. Salvaged from #31414 by @CryptoByz onto current main (branch was ~3962 commits behind; the consume function had since been refactored for issue #34597). Co-authored-by: CryptoByz.	2026-06-27 19:43:02 -07:00
Shashwat Gokhe	505bc27d8d	fix(gateway): classify mixed attachments per-attachment + transcode uncommon image formats A document attached alongside an image in the same Discord message was swept into the vision pipeline and 400'd the whole turn ("Could not process image"), and was simultaneously never surfaced to the agent as a readable file. Restores the "any file type works" contract for mixed messages and fixes the HTTP 400. Bug 1 — mixed attachments: the inbound routing loop keyed image/audio/video classification off the message-level type (PHOTO/VOICE/AUDIO), so a doc in a PHOTO message landed in image_paths and poisoned the vision call. The document context-note path was gated on message_type == DOCUMENT, so that same doc never reached the agent at all. Now classification is per-attachment (trust each attachment's own MIME; fall back to the message-level type only when MIME is unknown), via shared _event_media_is_* helpers used by both _build_media_placeholder and the main inbound loop. The document note now fires for any non-image/audio/video attachment regardless of message-level type. Bug 2 — uncommon formats: AVIF/HEIC/BMP/TIFF/ICO produced the same generic 400 because providers only accept PNG/JPEG/GIF/WEBP. image_routing now transcodes those to PNG via Pillow before declaring media_type, skipping cleanly (logged) if Pillow/plugins are missing. SVG is vector — Pillow can't rasterize it — so it's skipped rather than transcoded. Closes #25935. Co-authored-by: LeonSGP43 <cine.dreamer.one@gmail.com> Co-authored-by: cypres0099 <74935762+cypres0099@users.noreply.github.com>	2026-06-27 19:26:04 -07:00
Teknium	1207d81eed	fix(gateway): unify outbound chat redaction onto authoritative redactor (#23810 ) (#53907 ) The gateway banner promises 'chat responses are scrubbed before delivery', but _redact_gateway_user_facing_secrets used a divergent 6-pattern subset that leaked credential shapes the comprehensive agent.redact catches — notably the GitHub fine-grained PAT (github_pat_...) and the Telegram bot-token shape (bot<digits>:<token>), the gateway's own credential type. _redact_gateway_user_facing_secrets now delegates to agent.redact.redact_sensitive_text(force=True) — the same Tirith-grade redactor already applied to logs, tool output, and approval-command prompts — so the outbound LLM-response path (final_response -> _sanitize_gateway_final_response) masks the full credential set. The narrow local pattern set is kept as a fail-soft second pass. force=True honors redaction even when security.redact_secrets is off, matching _redact_approval_command. Test: regression guard parametrizing all 5 issue shapes x every chat surface; asserts secret body never reaches the user and surrounding prose survives. The existing bearer-token test's marker assertion is loosened from the literal '[REDACTED]' to mask-agnostic (the redactor masks as '***'/partial) — it asserts the security invariant, not the implementation's mask string.	2026-06-27 19:09:41 -07:00
teknium1	ccf526964a	fix(gateway): bound adapter teardown awaits on the stop path (#14128 ) The main stop loop in _stop_impl() awaited adapter.cancel_background_tasks() and adapter.disconnect() with no timeout, for both the primary and the secondary-profile (multiplex) adapter maps. A half-dead platform — a wedged Feishu/Lark WebSocket thread blocked on network I/O is the reported case — makes one of those awaits block forever, so the process never exits. systemd then SIGKILLs it after TimeoutStopSec, skipping atexit PID-file cleanup, and the next start dies with 'PID file race lost' and enters a restart loop. The per-adapter timeout infra already existed on main (_adapter_disconnect_timeout_secs / HERMES_GATEWAY_ADAPTER_DISCONNECT_TIMEOUT, default 5s) but was only wired into _safe_adapter_disconnect, which the teardown path never calls. Add _bounded_adapter_teardown(): wraps BOTH cancel_background_tasks() and disconnect() in the existing timeout budget, logs and forces forward progress on timeout, and never raises. Both teardown loops now route through it, so the stop sequence always completes regardless of any adapter's internal behavior and PID-file cleanup runs. Original report + fix direction by @happy5318 (#14128, #14130); this widens it to cover cancel_background_tasks(), the multiplex loop, and the config knob. Co-authored-by: happy5318 <happy5318@users.noreply.github.com>	2026-06-27 19:05:04 -07:00
Teknium	d3d621f7c3	revert(windows): roll back terminal-popup PRs #53791 #53810 #53829 (#53853 ) * Revert "fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829)" This reverts commit `2ecca1e7d3`. * Revert "fix(windows): stop terminal-window popups from background spawns (#53810)" This reverts commit `5db1430af9`. * Revert "fix(windows): stop subprocess console-window popups + add CI guard (#53791)" This reverts commit `ef17cd204d`.	2026-06-27 15:59:00 -07:00
Teknium	1d32e5d98c	fix(gateway): relay _thinking bubbles when thinking_progress is on but tool_progress is off (#53849 ) display.thinking_progress is documented as independent of tool_progress — users can keep tool progress quiet while opting into mid-turn assistant scratch-text bubbles. But two gates were keyed on tool_progress_enabled alone, so with tool_progress:off the _thinking relay was silently dead even when thinking_progress:true: 1. agent.tool_progress_callback was set to None unless tool_progress_enabled, so the callback that queues _thinking text never fired. 2. The send_progress_messages drain task was only started when tool_progress_enabled, so even queued messages had no consumer. Both now gate on needs_progress_queue (tool_progress OR thinking_progress) — the same condition that already decides whether to create the progress queue at all. No effect when both are off (queue is None) or when tool_progress is on (unchanged). Tests: _thinking relays with thinking_progress:on/tool_progress:off, and is suppressed when thinking_progress:off. Full progress-topics suite: 35 pass.	2026-06-27 15:48:20 -07:00
Teknium	2ecca1e7d3	fix(windows): capture is not a no-window boundary; route flashing spawns through chokepoint (#53829 ) Follow-up to #53791 addressing review feedback: the footgun checker treated capture_output=/stdout=/stderr=/check_output as proof a subprocess can't pop a Windows console. That invariant is false — stream redirection controls where a child's output goes, not whether a console is allocated. From a console-less parent (Desktop/Electron, pythonw.exe, detached gateway/cron) a console-subsystem child still flashes a window even when fully captured. - check-windows-footguns.py: capture/redirect/check_output is no longer a blanket safe-pass. Added _WINDOWS_FLASHING_PROGRAMS (git/gh/npm/node/python/uv/ffmpeg/ docker/powershell/…); calls to those are flagged even when captured. Non-flashing programs keep the capture exemption (no 271-site noise). _subprocess_compat.run/ popen calls are inherently safe (wrapper injects CREATE_NO_WINDOW). - Routed the 35 genuine flashing git/gh/npm/uv/ffmpeg/docker spawns through the _subprocess_compat.run/popen chokepoint (Brooklyn's wrapper from #53810) — the durable fix, not per-site annotations. cmd.exe /c start stays # ok (intentional). - Updated tests + CONTRIBUTING.md rule #17 to the corrected invariant.	2026-06-27 14:49:41 -07:00
brooklyn!	5db1430af9	fix(windows): stop terminal-window popups from background spawns (#53810 ) * fix(windows): stop terminal-window popups from background spawns Native-Windows desktop/gateway users saw cmd/conhost windows flash on gateway restart, image paste, the dashboard Projects tree, voice notes, and ~5 min after closing the app (detached cron). Two root causes: - Console-subsystem exes (taskkill, schtasks, wmic, netstat, tasklist, agent-browser, git, ffmpeg, powershell, git-bash) spawned via raw subprocess allocate a fresh console when the launching process has none (pythonw desktop backend / detached gateway) - even with output captured. - uv venv pythonw shims re-exec console python.exe, so Python children get a console regardless of how they're launched. Fixes: - Single hidden-spawn primitive (_subprocess_compat.run/.popen) that ORs CREATE_NO_WINDOW on Windows, no-op on POSIX. Route every Hermes-owned console-exe spawn through it. - FreeConsole() catch-all in hermes_bootstrap: any Python child that exclusively owns an auto-allocated console detaches it at startup (GetConsoleProcessList()==1 gate leaves shared interactive consoles untouched). - Replace PowerShell/wmic gateway PID scans with in-process psutil. - Skip schtasks queries on non-interactive desktop restarts. - Prefer native agent-browser .exe over .cmd shims. - Guard test bans raw subprocess spawns of the Windows-only console tools repo-wide so the popup class can't regress. * fix(windows): scope FreeConsole to background entry points; fix merge fallout Console detach review (per #53810 feedback): GetConsoleProcessList()==1 can't tell a uv pythonw->python phantom console apart from a user opening the interactive CLI/TUI in its own fresh console (double-click, shortcut, ConPTY) — both report a single attached process with a tty. Running FreeConsole() in the import-time bootstrap therefore risked detaching a legitimately-interactive terminal. - Extract FreeConsole into explicit hermes_bootstrap.detach_orphan_console(); remove it from apply_windows_utf8_bootstrap() (import side effect). - Call it only from known background mains: gateway run, dashboard backend (start_server, what the desktop spawns), cron standalone, tui_gateway entry, slash worker. Interactive CLI/TUI never calls it. - Behavior-contract tests: frees only when solo owner, leaves shared console, no-op without console / on POSIX, and asserts it's not an import side effect. Merge fallout from origin/main (#53791): - local.py: 3-way merge left a dangling *_popen_kwargs (NameError crashing every terminal init). _subprocess_compat.popen already hides the window, so drop it. - discord adapter: merge stacked an undefined windows_hide_flags() onto the primitive call; drop the redundant arg. - test_gateway: scan now goes psutil-first (zero spawn); rewrite the case-variant test to drive that production path. test(claw): mock _subprocess_compat.run seam for Windows process scan claw.py's Windows tasklist/powershell scan routes through the hidden-spawn primitive; the tests still patched claw_mod.subprocess, so on win32 the mock was never hit and real spawns returned nothing. Patch the actual seam.	2026-06-27 14:02:24 -07:00
Teknium	d4c2217e87	fix(gateway): offload /model switch off the event loop (#53603 ) The Telegram/Discord /model command's actual switch calls switch_model() directly on the asyncio event loop. switch_model() can fall through to a synchronous models.dev HTTP fetch (requests.get, 15s timeout) on a cold or expired cache, freezing the gateway for up to 15s and dropping the Telegram connection while a user switches models. The picker provider-list and fallback text-list sites were already offloaded (#41289), but the two _switch_model() calls — the picker callback and the direct /model <name> path — were not. Wrap both in asyncio.to_thread. Closes #20525.	2026-06-27 04:36:22 -07:00
Teknium	caf4dcc7ad	fix(whatsapp): resolve phone↔LID aliases in adapter DM/group allowlist (#53588 ) Some checks failed CI / Detect affected areas (push) Waiting to run Details CI / Python tests (push) Blocked by required conditions Details CI / Python lints (push) Blocked by required conditions Details CI / TypeScript (push) Blocked by required conditions Details CI / Docs Site (push) Blocked by required conditions Details CI / Deny unrelated histories (push) Blocked by required conditions Details CI / Check contributors (push) Blocked by required conditions Details CI / Check uv.lock (push) Blocked by required conditions Details CI / Lint Docker scripts (push) Blocked by required conditions Details CI / Build&Test Docker image (push) Blocked by required conditions Details CI / Supply-chain scan (push) Blocked by required conditions Details CI / OSV scan (push) Waiting to run Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Build Skills Index / build-index (push) Has been cancelled Details Build Skills Index / trigger-deploy (push) Has been cancelled Details The adapter-level intake gate (_is_dm_allowed / _is_group_allowed, reached via _should_process_message) did a raw set-membership check against the configured allowlist. WhatsApp now delivers inbound DM senders in LID form (<id>@lid) while operators configure allowlists with phone numbers, so the check never matched and every DM from an allowed contact was silently dropped before the gateway authz layer ran. Route both gates through the existing gateway.whatsapp_identity. expand_whatsapp_aliases helper (already used by gateway authz and session keys), which walks the bridge's lid-mapping-*.json session files. Phone and LID forms now resolve to each other in both directions; exact JID matches, wildcard, disabled/open policies, and empty-allowlist fail-closed behavior are all preserved. Fixes #14486	2026-06-27 04:17:12 -07:00

1 2 3 4 5 ...

2278 commits