hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-06-28 11:32:22 +00:00

Author	SHA1	Message	Date
kshitij	5de8a8fbe8	Merge pull request #52375 from NousResearch/salvage/47237-dedupe-user-turns fix(gateway): dedupe user turns on transient failure (#47237)	2026-06-26 00:30:59 +05:30
davidgut1982	6208d6b3be	fix(gateway): dedupe user turns on transient failure (#47237 ) When the gateway persists a user message after a transient provider failure (429/timeout/auth error), subsequent retries of the same Telegram message could stack duplicate user turns in the transcript, causing the agent to fall behind by 1-2 messages. Add has_platform_message_id() to SessionDB (using the existing idx_messages_platform_msg_id partial index) and a SessionStore wrapper. The gateway's transient-failure path checks this before append_to_transcript -- if the platform_message_id is already persisted, the duplicate write is skipped. Salvaged from #47869 by @davidgut1982. Adapted to current main which has additional append sites and an existing content-based dedupe in the exception handler path. Closes #47237	2026-06-26 00:11:17 +05:30
kshitij	d682f320b3	Merge pull request #52147 from NousResearch/salvage/29184-mcp-osv-nonblocking fix(mcp): run OSV malware preflight off the event loop with a bounded timeout (#29184)	2026-06-25 23:39:44 +05:30
kshitij	c210e23a02	Merge pull request #52386 from NousResearch/salvage/31999-yaml-indent fix(utils): unify YAML list indent across all config writers (#31999)	2026-06-25 23:39:37 +05:30
qdaszx	6305ac0e4b	fix(mcp): run OSV malware preflight off the event loop with a bounded timeout (#29184 ) During stdio MCP server startup, _run_stdio (an async method) called the synchronous check_package_for_malware() inline. That makes a blocking urllib HTTPS POST to api.osv.dev whose own timeout doesn't reliably cover a stalled SSL handshake, so an intermittent network issue froze the entire asyncio event loop for up to ~120s — blowing past the TUI/gateway's 15s startup budget and showing "gateway startup timeout". Run the check via asyncio.to_thread (off the loop) AND bound it with asyncio.wait_for(timeout=_OSV_MALWARE_CHECK_TIMEOUT_S=12s). The malware check is fail-open, so on timeout we log and proceed rather than blocking startup. Salvaged from #29190 by @qdaszx (re-applied on current main — the call site moved since the PR was opened), combining the to_thread approach also proposed in #29192 by @ygd58. Two load-bearing tests: event-loop-not-blocked-during- check and timeout-fails-open — both mutation-verified to fail against the old inline blocking call. Closes #29184. Co-authored-by: ygd58 <buraysandro9@gmail.com>	2026-06-25 23:30:41 +05:30
xxxigm	0aea0c3654	fix(utils): unify YAML list indent across all config writers (#31999 ) atomic_yaml_write used default yaml.dump which emits indentless sequences (list items at column 0), while atomic_roundtrip_yaml_update (ruamel.yaml) emits 2-space-indented sequences. Cross-path writes to the same config.yaml toggled indentation on every save, eventually producing a mixed-indent file that js-yaml rejects with 'bad indentation of a mapping entry', silently dropping custom_providers and breaking model switching. Add IndentDumper SafeDumper subclass that forces indentless=False, route atomic_yaml_write through it. Route tui_gateway._save_cfg and the Telegram adapter's config writer through atomic_yaml_write so all paths emit the same 2-indent layout. Salvaged from #32034 by @xxxigm. Adapted to current main which already has allow_unicode=True (from #51356) but was missing IndentDumper. Closes #31999	2026-06-25 23:27:44 +05:30
brooklyn!	a53fc78c02	Merge pull request #52594 from NousResearch/bb/queue-resubmit-on-busy fix(tui_gateway): queue mid-turn prompts instead of dropping them on a busy retry	2026-06-25 12:50:18 -05:00
xxxigm	d93abd75d1	test(terminal): cover sudo cache invalidation and multi-invocation piping	2026-06-25 23:08:48 +05:30
brooklyn!	931a5e92cc	Merge pull request #52592 from NousResearch/bb/close-interrupt-tool-seq-sibling-paths fix(agent): close tool-call sequence on all interrupt aborts (#48879 follow-up)	2026-06-25 12:31:27 -05:00
Brooklyn Nicholson	70319626a9	fix(tui_gateway): queue mid-turn prompts instead of dropping them on a busy retry A prompt sent while a turn was in flight got rejected with 4009 "session busy", which pushed clients (the desktop app) into a deadline-bounded busy-retry. When turn teardown outlived that deadline — e.g. the user hits stop while a slow, non-interruptible tool (web_search, read_file, an MCP call) is mid-flight, since the sequential executor only checks the interrupt flag between tools — the resubmitted message was silently dropped: "it just doesn't listen". Wire the previously-dead display.busy_input_mode config into prompt.submit: instead of rejecting, apply the policy and queue the message to run as the next turn (drained in run()'s tail, ahead of goal/notification follow-ups). Modes: interrupt (default) interrupts the live turn so it winds down promptly then runs the queued message; queue runs it after the current turn finishes; steer injects it into the live turn when accepted, else queues. The queued slot pins the sender's transport and losslessly merges a second arrival. No client deadline, no dropped sends.	2026-06-25 12:29:49 -05:00
Brooklyn Nicholson	2d286a6d00	fix(agent): close tool-call sequence on all interrupt aborts, not just finalize_turn #48879 closed the tool-call sequence on interrupt inside finalize_turn so a /stop after a tool no longer persists a `tool` tail that the next user message turns into a `tool -> user` role-alternation violation (which strict providers like Gemini/Claude react to by hallucinating a continuation and ignoring prior context — what users see as "lost context after stop"). But the retry-wait, error-handling, and post-error retry-wait interrupt aborts in conversation_loop return early and never reach finalize_turn, so they still persisted and returned a raw `tool` tail. Interrupting during provider backoff/rate-limiting (common under heavy work) hit exactly this path. Extract the close into a shared close_interrupted_tool_sequence helper and apply it at every interrupt abort (finalize_turn + the three early returns) so the whole bug class is fixed, not just the one site.	2026-06-25 12:24:34 -05:00
Brooklyn Nicholson	1d9ed7f48a	fix(desktop): ad-hoc sign macOS self-update rebuilds The desktop self-updater rebuilds and re-signs the .app on each user's own machine (`hermes desktop --build-only` -> electron-builder `--dir`). With CSC_IDENTITY_AUTO_DISCOVERY on (its default), electron-builder signs the type=distribution, hardened-runtime bundle with whatever identity is in that user's keychain -- typically a personal "Apple Development" cert -- which stalls/fails the sign step (no Developer ID, no provisioning profile) or clobbers the original notarized signature with an unusable one, tripping Gatekeeper on every post-update launch. Force ad-hoc signing for the local packaged rebuild instead: deterministic, and exactly what _desktop_macos_relaunchable_fixup already finishes off. No-op for source runs, off-macOS, when a real identity is configured (CSC_LINK / APPLE_SIGNING_IDENTITY), or when the caller already pinned the flag.	2026-06-25 12:08:29 -05:00
Ben Barclay	d6269da7fd	fix(gateway): harden scale-to-zero dormancy guards (#52359 ) Some checks are pending CI / detect (push) Waiting to run Details CI / tests (push) Blocked by required conditions Details CI / lint (push) Blocked by required conditions Details CI / typecheck (push) Blocked by required conditions Details CI / docs-site (push) Blocked by required conditions Details CI / history-check (push) Blocked by required conditions Details CI / contributor-check (push) Blocked by required conditions Details CI / uv-lockfile (push) Blocked by required conditions Details CI / docker-lint (push) Blocked by required conditions Details CI / supply-chain (push) Blocked by required conditions Details CI / osv-scanner (push) Blocked by required conditions Details CI / All required checks pass (push) Blocked by required conditions Details Deploy Site / deploy-vercel (push) Waiting to run Details Deploy Site / deploy-docs (push) Waiting to run Details Docker Build and Publish / build-amd64 (push) Waiting to run Details Docker Build and Publish / build-arm64 (push) Waiting to run Details Docker Build and Publish / merge (push) Blocked by required conditions Details Block scale-to-zero suspend while background async delegations are active, and restore runtime status to running on real inbound after a dormant wake.\n\nAdd regression coverage for both review findings.	2026-06-25 20:41:03 +10:00
Teknium	e62afaca62	fix(learn): teach /learn the full CONTRIBUTING.md skill standards (#52372 ) The /learn authoring prompt taught a subset of the HARDLINE skill rules, and stated the <=60-char description rule without making the model enforce it — so generated descriptions overshot (up to 202 chars), which the 60-char system-prompt skill index then silently truncates. - description: add the index-truncation rationale, a count-and-trim self-check, and a good/bad length example so the model actually hits <=60. - add platforms-gating rule (OS-bound primitives -> declare platforms:). - add author-credits-human-first rule. - round out the Hermes-tool framing with the full wrapped-tool mapping and references/templates layout. Closes #52367.	2026-06-25 00:17:23 -07:00
benbenwyb	6f2b2a1f34	fix: handle named custom providers and Z.AI overload retries	2026-06-25 00:17:17 -07:00
Ben Barclay	736e981abf	fix(auth): honor NOUS_INFERENCE_BASE_URL env override for Nous OAuth sessions (#52270 ) The host-allowlist hardening (#30611) plus the refresh heal (#49735) left the documented NOUS_INFERENCE_BASE_URL dev/staging escape hatch unreachable for OAuth sessions, despite three code comments asserting it still works. Root cause — resolution precedence in resolve_nous_runtime_credentials: inference_base_url = ( _optional_base_url(state.get("inference_base_url")) # stored — wins or os.getenv("NOUS_INFERENCE_BASE_URL") # env — unreachable or DEFAULT_NOUS_INFERENCE_URL ) A staging OAuth login persists its inference_base_url, but the allowlist rejects the staging host and the refresh heal rewrites the stored value to the production default. The stored (now prod) value is then read BEFORE the env var, so the override never takes effect — every request 401s against prod or is pinned to prod, and setting the env var does nothing. Fix: the user-set env override is the most-trusted source, so consult it FIRST for the URL used to build the client / returned to callers — while keeping the PERSISTED value the validated, network-provenance one (the override is a runtime overlay, never written to auth.json, so unsetting it cleanly reverts to prod). Applied at both chokepoints: - resolve_nous_runtime_credentials (no-refresh read path AND refresh path) - the nous_portal proxy adapter, which re-validates the resolver's returned base_url against the prod allowlist as defense-in-depth and would otherwise reject a legitimate staging override at the forward boundary. New _nous_inference_env_override() / split of stored-vs-effective URL keep the threat model intact: Portal-returned URLs are still allowlist-validated at every network site, and the env path stays ungated (trusted OS user). Also folds in the no-refresh read-path heal (supersedes the approach in the open #50265): a poisoned stored staging host now heals to the prod default on read even when no refresh fires. Tests: TestEnvOverrideWins (env wins on read + refresh paths; override never persisted; poisoned stored heals) and TestProxyAdapterEnvOverride. Verified the 4 behavioral tests fail against pre-fix code and pass with the fix; full inference-validation + nous-provider suites green (85 passed). E2E-validated against a real temp HERMES_HOME exercising the real resolver + proxy adapter: resolver→staging, persisted→prod, proxy→staging, unset→reverts to prod.	2026-06-25 00:11:15 -07:00
kshitijk4poor	d6cf383d74	refactor(setup): simplify Z.AI picker — drop dead fallback, fix tests - Remove dead `chosen_base or effective_base` fallback; _select_zai_endpoint always returns a non-empty base URL (returns current_base on cancel). - Add .rstrip("/") to official-endpoint return for symmetry with custom-proxy path (both now return normalized URLs). - Replace magic index 4 with len(ZAI_ENDPOINTS) in custom-proxy tests so they don't break if a 5th endpoint is added to ZAI_ENDPOINTS.	2026-06-25 12:07:01 +05:30
kshitijk4poor	d0df264213	test(setup): add ZAI endpoint picker tests, move base-URL tests to MiniMax Z.AI now uses a curses picker instead of plain text input for base URL, so the existing TestBaseUrlValidation tests (which used zai as their test subject) are migrated to MiniMax, which still uses the text input path. Add TestZaiEndpointPicker covering: - Selecting each official endpoint (Global, China, Coding Plan Global, Coding Plan China) saves the correct base URL to config - Custom proxy URL entry (valid + invalid rejection) - Cancel keeps the existing base URL - Current endpoint is the default choice in the picker - Non-standard URL defaults to the Custom proxy option	2026-06-25 12:07:01 +05:30
Brooklyn Nicholson	f3d6d9bbd3	fix(ui): share compact tool previews across clients Move terminal/execute_code/read_file preview compaction into agent.display so CLI, gateway, and Ink TUI all inherit the same labels that desktop introduced in #52321. The shared preview keeps raw args intact while trimming display-only shell plumbing (`cd`, pipe tails, banner/status echoes) and read_file line ranges. Desktop now prefers backend `context` for live rows and keeps its TypeScript fallback only for hydrated history.	2026-06-25 00:47:14 -05:00
Brooklyn Nicholson	a5849917a8	test(pets): make slow pet generation suite opt-in The pet generation image-processing suite is deterministic but expensive enough to blow the per-file CI timeout on Linux (140s), and it is not relevant to the fast timeout PR's normal signal. Keep it available for manual validation, but do not run it by default. Set HERMES_RUN_SLOW_PET_TESTS=1 to enable the suite. The canonical test wrapper now preserves that opt-in variable through its hermetic env.	2026-06-25 00:44:53 -05:00
Teknium	7a65800fed	fix(cache): content-address prompt_cache_key so recurring cron jobs reuse the warm prefix (#52295 ) Recurring cron jobs were prompt-cache-cold on every fire. session_id is built as cron_<job_id>_<timestamp>, and the Codex/Responses transport used session_id directly as prompt_cache_key — so the timestamp changed the cache key on every run and the static prefix (agent identity + tool schemas) was re-paid each tick. Derive prompt_cache_key from a SHA-256 of the static prefix (instructions + sorted tool schemas) instead. Repeated fires of the same job share one content-addressed key (pck_<hash>) and reuse the warm prefix within the provider's cache TTL. The key changes exactly when the prefix changes — edit the job's prompt or toolset and it re-keys; leave it alone and it stays stable. session_id is left untouched for transcript isolation, log correlation, and the Codex/xAI session-scope routing headers (session_id, x-client-request-id, x-grok-conv-id) — those are the per-fire identity, not the cache key. Only the prompt_cache_key body field (standard OpenAI/Codex path and the xAI extra_body field) is content-addressed. Closes #51395. Co-authored-by: spiky02plateau <spiky02plateau@users.noreply.github.com> Co-authored-by: JoaoMarcos44 <JoaoMarcos44@users.noreply.github.com>	2026-06-24 21:46:30 -07:00
Ben Barclay	72ae163250	fix(relay): authorize relay-delivered events by delivery, not source.platform (#52306 ) * fix(relay): authorize relay-delivered events by delivery, not source.platform The #52190 upstream-authz fix keyed _is_user_authorized off source.platform via _adapter_authorization_is_upstream(source.platform). But a relay message inbound carries the UNDERLYING platform (source.platform == discord/telegram/...), NOT Platform.RELAY, because ws_transport._event_from_wire maps the connector's wire payload (platform="discord") straight onto SessionSource for session-keying and egress. The relay adapter is registered only under Platform.RELAY, so adapters.get(Platform.DISCORD) misses, the trusted-upstream branch is skipped, and the user hits the env-allowlist default-deny: WARNING gateway.run: Unauthorized user: <id> (<name>) on discord (Live staging bug: alpha tester linked successfully, then every follow-up DM was silently dropped.) Fix: the authentic trust signal is that the event was delivered over the per-instance-authenticated relay WS, not which platform it underlies. Add a wire-INVISIBLE SessionSource.delivered_via_upstream_relay flag, stamped by the relay transport in _event_from_wire, and authorize on it. The flag is excluded from to_dict/from_dict so a peer can neither forge it across the wire nor have it restored from persistence. The existing adapter-flag check is retained for events whose source.platform IS Platform.RELAY (interaction-passthrough). A direct Discord event on a multiplexing gateway (direct + relay adapters) is unmarked and still default-denies. * fix(relay): use identity check on delivery marker to avoid MagicMock fail-open A MagicMock() source (used by test_signal.py and other gateway tests) auto- vivifies source.delivered_via_upstream_relay as a truthy Mock, which a bare truthiness check would treat as authorized — flipping test_signal_in_allowlist_maps from False to True. The marker is a real bool on SessionSource, so check 'is True' explicitly: refuses to authorize any non-bool stand-in, defensive against accidental fail-open.	2026-06-25 14:21:09 +10:00
brooklyn!	0c442fa1d3	Merge pull request #52303 from NousResearch/bb/pets-gen-qa feat(pets): quality-first OpenRouter chain, stronger atlas gates, global pet-gen notifications	2026-06-24 23:16:40 -05:00
Brooklyn Nicholson	e92b5c6af8	feat(pets): quality-first OpenRouter model chain + stronger atlas gates + global pet-gen notifications OpenRouter/Nous image gen now runs a quality-first model chain by default: attempt the highest-fidelity OpenAI image model first, then fall back to Gemini 3 Pro Image when it's access-gated/unavailable/times out. An explicit OPENROUTER_IMAGE_MODEL / config model override pins one model with no fallback. Atlas validation rejects malformed model output instead of shipping it: adds a per-state collapse guard (a single sliver/fragment row no longer passes because other rows are healthy), on top of the existing postage-stamp + multi-pose checks. Desktop: pet-gen native notifications are now "global" (not tied to a chat session), so a background generation started from the command center fires an OS notification when the user is away even with no active session. Adds a neutral "This can take up to 5 minutes." banner on step 1, and lets the provider picker auto-size. Tests updated/added for the OpenRouter fallback chain, the collapse guard, and the global notification path.	2026-06-24 23:11:21 -05:00
brooklyn!	380d660cab	Merge pull request #52297 from NousResearch/bb/ad-hoc-verify Support ad-hoc verification scripts	2026-06-24 23:10:15 -05:00
brooklyn!	d473e5d07a	Merge pull request #52296 from NousResearch/bb/verify-stop-loop Add verification stop loop	2026-06-24 23:10:03 -05:00
brooklyn!	1512bad0bc	Merge pull request #52286 from NousResearch/bb/verify-status feat(gateway): expose coding verification status	2026-06-24 23:09:45 -05:00
brooklyn!	da0320bf40	Merge pull request #52285 from NousResearch/bb/verify-ledger feat(agent): record coding verification evidence	2026-06-24 23:07:10 -05:00
Brooklyn Nicholson	a5a2edd451	feat(agent): recognize focused ad-hoc verification scripts Allow focused temporary scripts to satisfy verification when no canonical suite is detected, while keeping suite evidence distinct from ad-hoc proof.	2026-06-24 23:03:45 -05:00
Brooklyn Nicholson	2f1a47b90e	feat(agent): require verification before finishing edits Make verification closure the default coding behavior after landed file edits while keeping bounded retries and config/env switches for users who need to disable it.	2026-06-24 23:02:48 -05:00
Brooklyn Nicholson	7ef0f360d0	feat(gateway): expose coding verification status Add a read-only gateway RPC for querying the passive verification ledger without running checks from the UI surface.	2026-06-24 22:36:03 -05:00
Brooklyn Nicholson	f0beb6f617	test(agent): cover verification evidence ledger Exercise command classification, session scoping, stale edits, bounded retention, and natural expiry for recorded verification evidence.	2026-06-24 22:35:27 -05:00
Victor Kyriazakos	b177d4ee48	fix(cron): mirror continuable cron as a labelled user turn (alternation-safe) Addresses review on #51077 (kxee). The continuable-cron mirror reused gateway.mirror.mirror_to_session, which writes role=assistant — re- introducing the exact alternation violation #2313 (`37a997945`) deliberately removed: a cron brief landing as assistant after the agent's last turn yields assistant->assistant, which breaks strict- alternation providers (OpenAI/OpenRouter) per issue #2221. The mirror/ mirror_source metadata is also dropped at the SQLite boundary, so the [Delivered from cron] label is lost on replay. This is an intentional, opt-in (default OFF) reversal of #2313's 'cron output does not belong in interactive history' for the reply-to- cron use case — gated behind cron.mirror_delivery / attach_to_session. Fixes: - mirror_to_session gains a role param (default 'assistant' — interactive send_message mirror unchanged, it IS the agent speaking). Cron paths pass role='user' with a '[Cron delivery: <task>]' prefix so the brief collapses via repair_message_sequence's consecutive-user merge on every provider, and stays distinguishable on replay despite the metadata drop. - thread_seeded: defer seeding + the flag until delivery into the new thread actually succeeds. Previously set pre-delivery, so an open- succeeds / deliver-fails case both stranded a seeded-but-unseen brief AND suppressed the DM-fallback mirror. - seed mirror now passes user_id='system:cron' to resolve the exact thread-keyed session row it just created. - dedupe the duplicate BasePlatformAdapter import in _deliver_result. - trim oversized docstrings to non-obvious WHY (AGENTS.md). - docs: document cron.mirror_delivery / attach_to_session in website/docs/user-guide/features/cron.md. - test: assert the cron mirror writes role='user' with the label prefix. 204 cron+mirror tests pass.	2026-06-24 20:27:05 -07:00
Victor Kyriazakos	b693bee100	feat(cron): thread-preferred continuable delivery (open a thread, mirror DM fallback) Continuable cron jobs (attach_to_session / cron.mirror_delivery, default OFF) now prefer a dedicated thread on thread-capable platforms, falling back to origin-DM mirroring where threads don't exist. - Thread-capable (Telegram topics, Discord/Slack threads): open a fresh thread for the job via the shipped adapter.create_handoff_thread, route the brief into it, and seed the thread-keyed session so the user's in-thread reply continues with full context. This is the 'continuable cron opens its own thread' interface. - DM-only (WhatsApp/Signal/SMS): create_handoff_thread returns None -> fall back to mirroring into the origin DM session (existing behaviour). Reuses existing infrastructure end-to-end — no new adapter surface, no provider-chain signature change: - adapter.create_handoff_thread (already implemented per-platform, returns None on unsupported platforms = the fallback signal) - the live SessionStore via adapter._session_store (already set on every adapter), reached without threading a new param through the frozen CronScheduler.start() contract - gateway.mirror.mirror_to_session for the seed/append - existing per-target delivery routing carries the new thread_id for free Mirrors GatewayRunner._process_handoff's open-thread-or-fallback + seed pattern, standalone for the cron delivery path. thread_seeded guards against a double-mirror after seeding. Scoped to the origin target only; fan-out/broadcast targets are never threaded or mirrored. Config docs updated (cron.mirror_delivery) + cronjob tool attach_to_session description reframed around continuable/thread-preferred. Tests: +5 (thread id returned on thread platform; None on DM platform; None without capability/loop; seed creates thread session + mirrors; seed no-op on empty). 22/22 in TestCronDeliveryMirror; 532 cron tests pass (4 failures pre-existing: croniter-not-installed + TZ).	2026-06-24 20:27:05 -07:00
Victor Kyriazakos	98f3c19282	feat(cron): pass origin user_id to delivery mirror (send_message parity) Multi-participant parity with interactive send_message, which passes HERMES_SESSION_USER_ID to gateway.mirror.mirror_to_session so the mirror lands in the exact participant's session. - cronjob_tools._origin_from_env now captures user_id from the session context at job-create time (alongside platform/chat_id/thread_id). - _maybe_mirror_cron_delivery forwards user_id to mirror_to_session. - _deliver_result threads origin.user_id through for the origin target. Effect: in a per-user-isolated group chat (group_sessions_per_user=True, the default), the mirror resolves to the member who scheduled the job instead of conservatively no-op'ing on ambiguous candidates. DMs and shared group/thread sessions are unaffected (single candidate). Default still OFF. Tests: helper forwards user_id; E2E _deliver_result forwards origin user_id. 17/17 in TestCronDeliveryMirror; 527 cron tests pass (4 failures pre-existing: croniter-not-installed + TZ, identical on baseline).	2026-06-24 20:27:05 -07:00
Victor Kyriazakos	c06ceb3232	refactor(cron): scope delivery mirror to the origin conversation The cron->session mirror now fires ONLY for the delivery target that equals the job's origin (platform+chat_id[+thread_id]). A job created from a live gateway chat stamps that chat as origin, and that session is guaranteed to exist (it is the conversation the user scheduled the job in). Fan-out / broadcast / home-channel-fallback targets are never mirrored: they are not a continuation of a conversation and may have no session at all. This makes the prior 'cold-start session seeding' concern a non-case by construction: when the mirror semantically applies the session exists; when none exists the target was never the origin, so we no-op. Adds _target_matches_origin() + origin-scoping tests (exact match, other-chat/other-platform/no-origin rejection, thread scoping, fan-out mirrors only the origin target).	2026-06-24 20:27:05 -07:00
Victor Kyriazakos	1b181724fa	feat(cron): optional mirror of cron delivery into target chat session Adds an opt-in path so a cron job's delivered output is also appended to the TARGET chat's gateway session transcript (as an assistant turn), so a user reply to a recurring delivery (daily brief, reminder) is answered with the delivery in context instead of 'what is that?' amnesia. - Reuses the shipped gateway.mirror.mirror_to_session — the same primitive interactive send_message mirroring already uses. No messaging-toolset change (cron still can't call send_message; this rides delivery). - Gated: per-job attach_to_session overrides global cron.mirror_delivery (config.yaml). Default OFF — historical isolation preserved byte-for-byte. - Mirrors the CLEAN agent output, not the cron header/footer wrapper. - Alternation/cache-safe: append lands at a turn boundary, never mid-loop, never mutates the cached system prompt. Cold-start (no target session) is a silent no-op; mirror errors never fail a successful delivery. - Surfaced on the cronjob tool (attach_to_session) + config schema. Driven by enterprise cron-as-control-plane use case. 10 new tests; full cron + cronjob-tool suites pass (600).	2026-06-24 20:27:05 -07:00
Ben	0c3f197cff	fix(relay): re-attach DM author user_id on outbound for connector egress A DM reply carries no guild_id, so the connector's egress guard cannot resolve the owning tenant from metadata.guild_id and declines the send with "discord egress declined: target not routed to an onboarded tenant" — the bug behind "the bot never replies in DMs". Guild replies are unaffected (they carry guild_id), which is why the guild path worked end-to-end while DMs looked broken. The connector now resolves a DM reply's tenant from the recipient's author binding (gateway-gateway #67, resolveByUser keyed on metadata.user_id) — the outbound counterpart to inbound Phase 7a author-first resolution. But it needs the recipient user_id ON the outbound action, and the adapter only re-attached guild_id (_capture_scope/_with_scope), no-op for DMs (the docstring even said so). This extends the adapter's inbound-scope capture: for a DM (no guild_id) remember chat_id -> the authentic author user_id we observed, and re-attach it as metadata.user_id on outbound. Guild capture is unchanged and wins when present; user_id is the DM-only fallback. The id is the one the connector observed inbound (never gateway-asserted), so the trust invariant holds. +4 unit tests (DM reply re-attaches user_id + no guild_id; unknown chat invents nothing; explicit user_id preserved; guild reply never carries user_id). Proved load-bearing (reverting the re-attach fails the DM test). 144 relay tests pass, ruff clean. Pairs with gateway-gateway #67 (the connector-side resolver). Together they close the DM-reply egress gap end-to-end.	2026-06-25 12:43:54 +10:00
Ben Barclay	c15945655f	fix(terminal): sanitize host/relative cwd OVERRIDE before it reaches docker run -w (#50636 ) terminal_tool() resolves a per-task cwd override that WINS over config["cwd"]: cwd = overrides.get("cwd") or config["cwd"] config["cwd"] is sanitized for container backends in _get_env_config() (host prefixes /Users//home//C:\\/C:/ and relative paths are replaced with the backend default /root). But the override was applied RAW — it was never run through that guard. The gateway/TUI registers the host launch dir as a cwd override for workspace tracking (tui_gateway/server.py _register_session_cwd -> _terminal_task_cwd -> _session_cwd -> os.getcwd()), so on a container backend a host path leaked straight to `docker run -w <host-path>`: - Windows desktop: -w C:\Users\<user> -> container fails to start (exit 125) - POSIX: -w /home/<user> -> same The ACP adapter translates its override cwd (acp_adapter/session.py _translate_acp_cwd), but the gateway path did neither translation nor sanitization, so the override bypassed the one guard that would have caught it. Fix: extract the host/relative-path predicate into a shared _is_unusable_container_cwd() helper (so the existing _get_env_config() sanitizer and the new guard can't drift), and re-apply it to the resolved cwd at the override-resolution site. Valid in-container override paths (RL/benchmark sandboxes that set cwd to /workspace, /root, ...) are absolute non-host paths and pass through untouched. Tests: unit-pin the predicate (Windows backslash/forwardslash, POSIX home, macOS /Users, relative, valid container paths) AND an E2E call-site pin that drives terminal_tool() with a host-path override registered and asserts the cwd reaching _create_environment is sanitized. Mutation-verified: reverting the call-site guard makes the two host-path E2E tests fail (showing the raw host path leaking) while the valid-/workspace-override test stays green.	2026-06-25 02:33:40 +00:00
Teknium	411faf08bd	fix(soul): installers seed the real default persona, upgrade legacy empty templates (#52246 ) The desktop bootstrap (and curl/PowerShell/docker installs) seeded ~/.hermes/SOUL.md with a comment-only scaffold that contained no persona text. That shadowed the runtime default (_ensure_default_soul_md -> DEFAULT_SOUL_MD), since seeding is guarded by 'if SOUL.md doesn't exist'. Result: every fresh installer install got the empty template instead of the documented Hermes persona; desktop just made it visible in onboarding. - install.sh / install.ps1 / docker/SOUL.md now write DEFAULT_SOUL_MD. - _ensure_default_soul_md() upgrades a SOUL.md still matching the known legacy scaffold in place; customized files (any deviation, incl. a persona appended below the comment) are never touched. - Detection normalizes CRLF/BOM so Windows-installer drift still matches.	2026-06-24 18:56:26 -07:00
Teknium	a4fa1481e2	fix(tui): route /learn through command.dispatch so the prompt fires (#52232 ) The Desktop GUI (tui_gateway) slash worker subprocess has no reader for the CLI's _pending_input queue. /learn's CLI handler prints the ack and puts the built prompt onto that queue, so in the TUI the prompt was silently dropped — ack shown, no LLM turn, no skill created (#51829). command.dispatch already handles 'learn' correctly (returns {type: send, message: build_learn_prompt(arg)}), but 'learn' was missing from _PENDING_INPUT_COMMANDS, so slash.exec fell through to the worker instead of routing to command.dispatch. Add it to the frozenset, matching the existing goal/queue/steer/plan pattern.	2026-06-24 18:48:50 -07:00
Ben	d1cac0e5ef	feat(gateway): scale-to-zero idle detection + dormant-quiesce (Phase 0) The gateway-side BEHAVIOUR layer that consumes the relay scale-to-zero primitives (gateway-gateway Phase 5): the gateway decides it is idle and drives the relay transport dormant so the platform (Fly autostop:"suspend") can suspend the now-traffic-idle machine, which wakes on the connector's wakeUrl poke (decisions.md Q3=C', D1-D13). - gateway/scale_to_zero.py: pure helpers — scale_to_zero_enabled (the NAS Labs HERMES_SCALE_TO_ZERO stamp, D11/Q8=A), parse_idle_timeout_seconds (config.yaml gateway.scale_to_zero.idle_timeout_minutes, D2), messaging_is_relay_only_or_absent (F6/D1), should_arm (D1/D11/§3.4(1)), is_idle (D2/D3/F7). - gateway/run.py: _last_inbound_at clock stamped on user inbound in _handle_message (F13); the arm-gate + idle predicate + the _scale_to_zero_watcher dormant sequence (mark draining -> adapter go_dormant() -> cooldown), started only when armed. Deliberately NOT the stop path and NOT mark_resume_pending (F12/D13). - tools/process_registry.py: has_any_active() for the bg-work guard (D3/F7). - hermes_cli/config.py: gateway.scale_to_zero.idle_timeout_minutes default 5. Tests: 38 pure-logic + 6 watcher (incl. bg-work regression guard proven RED). Full relay + scale-to-zero suites: 184 passed. The 20 unrelated failures in the broader run are PRE-EXISTING on origin/main (custom-provider/tools tests), confirmed via a pristine baseline worktree.	2026-06-24 18:47:18 -07:00
Ben	96af4bec30	feat(relay): add go_dormant() transport mode for scale-to-zero (0.E0) Net-new WebSocketRelayTransport.go_dormant() + RelayAdapter.go_dormant() — the third transport mode the scale-to-zero behaviour layer needs, distinct from both disconnect() and an unexpected close (decisions.md D12/F14): - disconnect() sets _closing=True and CANCELS the reconnect supervisor (terminal "shutting down for good") -> a suspended machine never re-dials on wake, stranding its buffered backlog. - an unexpected close re-dials IMMEDIATELY -> the socket never stays down, so the platform proxy never suspends the machine. go_dormant(): going_idle->ack (reuse go_idle), then close the socket WITHOUT setting _closing, so the reader's fall-through still arms the reconnect supervisor (wake path stays live) but on the longer _dormant_redial_s cadence so it doesn't fight the platform suspend window. A successful re-dial clears _dormant. Honors the §3.4 wake->reconnect->drain contract. Tests: 6 new in test_relay_going_idle.py incl. the F14 regression guard (routing dormancy through disconnect() fails exactly the 4 wake-path tests). Full relay suite 140 passed.	2026-06-24 18:47:18 -07:00
helix4u	17beb55e3c	fix(telegram): gate rich draft previews separately	2026-06-24 18:11:14 -07:00
brooklyn!	7157b213f5	Merge pull request #47959 from NousResearch/bb/pets-gen Pet generation: frame-perfect hatch flow, backend picker, CPU-safe chroma, and CI-hardening	2026-06-24 19:41:34 -05:00
Brooklyn Nicholson	a05a9b0e07	test(delegate): harden heartbeat in-tool stale timing assertion Stabilize the long-running-tool heartbeat test by patching stale thresholds inside the test and asserting the heartbeat exceeds the idle ceiling, which preserves intent while removing scheduler-sensitive assumptions that flake in CI.	2026-06-24 19:33:40 -05:00
brooklyn!	b649cdee4a	Merge pull request #52203 from NousResearch/bb/update-drain-announce fix(update): announce gateway drain waits so desktop updates don't look hung	2026-06-24 19:28:44 -05:00
Ben	538c419d2e	fix(gateway): scope dashboard liveness fallback to the profile PR #52151 hardened the runtime-status liveness check to trust a readable live process command line over stale gateway_state.json argv, so a recycled PID now owned by an s6 supervisor no longer counts as a running gateway. That fix is correct but incomplete for the reported symptom: the web dashboard showed a named profile's gateway green while `hermes -p <name> gateway status` showed it stopped. Two further issues: 1. Cross-profile PID reuse. In per-profile Docker supervision, one profile's stale `gateway_state.json` can record a PID the OS later recycled onto a DIFFERENT profile's live gateway. That PID's command line still `looks_like_gateway`, so the dead profile was reported running. The recorded argv has its `-p <name>` selector stripped in-process by `_apply_profile_override`, so it cannot disambiguate; the live `/proc` cmdline still carries it. `get_runtime_status_running_pid` now accepts an `expected_home` and validates the live command line belongs to THAT profile (mirroring `hermes_cli.gateway._matches_current_profile`, the logic the CLI scan path already uses — which is why the CLI was correct). `_check_gateway_running` passes the enumerated profile dir. 2. The existing regression test `test_gateway_running_check_falls_back_to_ runtime_state` used the live pytest PID with a gateway-shaped record; once the live cmdline became authoritative it no longer looked like a gateway. Updated to mock the live cmdline to the real separate-process scenario it describes. The active-profile path (`get_running_pid`) is intentionally left unscoped: it is lock-verified and any live gateway cmdline is acceptable there. Multiplex mode is unaffected — `running` state is only ever written to a gateway's own home, never a secondary served profile's. Adds coverage for: cross-profile PID reuse (named + default), matching profile cmdline (`-p`, `--profile`, explicit HERMES_HOME=), the bare default gateway, and the unreadable-cmdline cross-platform fallback. Each new cross-profile assertion fails without the profile scope and passes with it. Co-authored-by: helix4u <4317663+helix4u@users.noreply.github.com>	2026-06-25 10:25:54 +10:00
helix4u	f1617a7ebb	fix(gateway): validate runtime status pid command line	2026-06-25 10:25:54 +10:00
AIalliAI	463bf2be25	fix(update): announce gateway drain waits so desktop updates don't look hung On macOS, the desktop updater's stage 1 (hermes update --gateway) ends by restarting running gateways. launchd_restart() SIGTERMs the gateway and silently waits up to agent.restart_drain_timeout (default 180s) for the drain; the manual profile-gateway loop waits its drain budget per gateway the same way. Neither path prints anything before the wait, so the desktop updater's live output goes dead for minutes right after '✓ Update complete!' — users read it as a hung update and force-kill their gateway processes to make it move (#44515). The systemd branch already announces its drain ('draining (up to Ns)...'); launchd and the manual loop did not. Print the stop/drain (with PID and budget) before the wait in both paths, mirroring the systemd branch, and assert the message in the existing launchd drain test. Fixes #44515	2026-06-24 19:12:44 -05:00

1 2 3 4 5 ...

6197 commits