Commit graph

6942 commits

Author SHA1 Message Date
teknium1
762eb79f1e fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451)
Two mitigations for the CLOSE_WAIT accumulation reported against QQ Bot
+ Feishu on macOS behind Cloudflare Warp.

1. Shared httpx.Limits helper (gateway/platforms/_http_client_limits.py).
   Every long-lived platform adapter now constructs httpx.AsyncClient
   with max_keepalive_connections=10 and keepalive_expiry=2.0, vs httpx's
   default of unbounded keepalive pool and 5.0s expiry. On macOS/Warp the
   default 5s window let idle keepalive sockets sit in CLOSE_WAIT long
   enough for seven persistent adapters (QQ Bot, WeCom, DingTalk, Signal,
   BlueBubbles, WeCom-callback, plus the transient Feishu helper) to
   compound to the 256-fd ulimit. Tunable via
   HERMES_GATEWAY_HTTPX_KEEPALIVE_EXPIRY and
   HERMES_GATEWAY_HTTPX_MAX_KEEPALIVE env vars.

2. whatsapp.send_typing aiohttp leak. The call was
   'await self._http_session.post(...)' with no 'async with' and no
   variable capture — the ClientResponse went out of scope unclosed,
   holding its TCP socket in CLOSE_WAIT until GC. Fixed by wrapping in
   'async with'. This was the only bare-await aiohttp leak in the
   gateway/tools/plugins tree per audit; all other aiohttp sites use
   the context-manager pattern correctly.

The underlying reporter also saw Feishu SDK (lark-oapi) connections in
CLOSE_WAIT — those are inside the SDK and out of our direct control, but
tightening httpx keepalive across adapters reduces the aggregate pool
pressure regardless of which individual adapter leaks.
2026-05-02 02:23:37 -07:00
beibi9966
38dd057e91 fix(feishu): finalize remote document downloads inside httpx.AsyncClient context (#18502)
Snapshot Content-Type and body while the client context is still
active so pooled connections fully release on exit. Previously the
read happened after `async with httpx.AsyncClient(...)` returned —
which works today only because httpx eagerly buffers non-streaming
responses; a future refactor to `.stream()` would silently read-
after-close.

Part of the #18451 connection-hygiene audit. Salvage of #18502.
2026-05-02 02:23:37 -07:00
Teknium
e444d8f29c
fix(gateway): config.yaml wins over .env for agent/display/timezone settings (#18764)
Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.
2026-05-02 02:14:35 -07:00
luyao618
13f344c5ce fix(agent): try fallback providers at init when primary credential pool is exhausted (#17929)
When a provider's credential pool has a single entry in 429-cooldown,
resolve_provider_client returns None and AIAgent.__init__ raises a
misleading RuntimeError suggesting the API key is missing — even when
valid fallback_providers are configured.

This patch makes __init__ iterate the fallback chain before raising,
mirroring the existing in-flight fallback logic in the request loop.
If a fallback resolves, the agent initializes against it and sets
_fallback_activated=True so _restore_primary_runtime can pick the
primary back up after cooldown.

Closes #17929
2026-05-02 02:09:46 -07:00
Teknium
1dce908930
fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log) (#18761)
* fix(gateway): config.yaml wins over .env for agent/display/timezone settings

Regression from the silent config→env bridge. The bridge at module import
time is correct for max_turns (unconditional overwrite), but every other
agent.*, display.*, timezone, and security bridge key was guarded by
'if X not in os.environ' — so a stale .env entry from an old 'hermes setup'
run would shadow the user's current config.yaml indefinitely.

Symptom: agent.max_turns: 500 in config.yaml, HERMES_MAX_ITERATIONS=60
in .env from an old setup, and the gateway silently capped at 60
iterations per turn. Gateway logs confirmed api_calls never exceeded 60.

Three changes:

1. gateway/run.py: drop the 'not in os.environ' guards for all agent.*,
   display.*, timezone, and security.* bridge keys. config.yaml is now
   authoritative for these settings — same semantics already in place
   for max_turns, terminal.*, and auxiliary.*. Also surface the bridge
   failure (previously 'except Exception: pass') to stderr so operators
   see bridge errors instead of silently falling back to .env.

2. gateway/run.py: INFO-log the resolved max_iterations at gateway
   start so operators can verify the config→env bridge did the right
   thing instead of chasing a phantom budget ceiling.

3. hermes_cli/setup.py: stop writing HERMES_MAX_ITERATIONS to .env in
   the setup wizard. config.yaml is the single source of truth. Also
   clean up any stale .env entry left behind by pre-fix setups.

Regression tests in tests/gateway/test_config_env_bridge_authority.py
guard each config→env key against the 'stale .env shadows config' bug.

* fix(gateway): shutdown + restart hygiene (drain timeout, false-fatal, success log)

Three issues observed in production gateway.log during a rapid restart
chain on 2026-05-02, all fixed here.

1. _send_restart_notification logged unconditional success
   adapter.send() catches provider errors (e.g. Telegram 'Chat not found')
   and returns SendResult(success=False); it never raises. The caller
   ignored the return value and always logged 'Sent restart notification
   to <chat>' at INFO, producing a misleading success line directly
   below the 'Failed to send Telegram message' traceback on every boot.
   Now inspects result.success and logs WARNING with the error otherwise.

2. WhatsApp bridge SIGTERM on shutdown classified as fatal error
   _check_managed_bridge_exit() saw the bridge's returncode -15 (our own
   SIGTERM from disconnect()) and fired the full fatal-error path,
   producing 'ERROR ... WhatsApp bridge process exited unexpectedly' plus
   'Fatal whatsapp adapter error (whatsapp_bridge_exited)' on every
   planned shutdown, immediately before the normal '✓ whatsapp
   disconnected'. Adds a _shutting_down flag that disconnect() sets
   before the terminate, and _check_managed_bridge_exit() returns None
   for returncode in {0, -2, -15} while shutting down. OOM-kill (137)
   and other non-signal exits still hit the fatal path.

3. restart_drain_timeout default 60s → 180s
   On 2026-05-02 01:43:27 a user /restart fired while three agents were
   mid-API-call (82s, 112s, 154s into their turns). The 60s drain budget
   expired and all three were force-interrupted. 180s covers realistic
   in-flight agent turns; users on very-long-reasoning models can still
   raise it further via agent.restart_drain_timeout in config.yaml.
   Existing explicit user values are preserved by deep-merge.

Tests
- tests/gateway/test_restart_notification.py: two new tests assert INFO
  is only logged on SendResult(success=True) and WARNING with the error
  string is logged on SendResult(success=False).
- tests/gateway/test_whatsapp_connect.py: parametrized test for
  returncode in {0, -2, -15} proves shutdown-time exits are suppressed;
  separate test proves returncode 137 (SIGKILL/OOM) still surfaces as
  fatal even when _shutting_down is set.
- _check_managed_bridge_exit() reads _shutting_down via getattr-with-
  default so existing _make_adapter() test helpers that bypass __init__
  (pitfall #17 in AGENTS.md) keep working unmodified.
2026-05-02 02:08:06 -07:00
teknium1
50f9f389ec chore(release): map ambition0802 email for AUTHOR_MAP
Follow-up for PR #17939 salvage.
2026-05-02 02:07:14 -07:00
ambition0802
7696ddc59e fix(cli): robust paste file expansion and process_loop error handling (#17666)
Two narrow fixes for long pasted messages silently disappearing:

1. _expand_paste_references: replace path.exists() + read_text() with
   try/except (OSError, IOError). Closes the TOCTOU window where a paste
   file deleted between check and read raised FileNotFoundError, bubbled
   up through process_loop's outer except, and silently dropped the
   user's input. Failures now return the placeholder text and log a
   warning.

2. process_loop outer except: logger.warning() instead of print().
   prompt_toolkit's TUI swallows stdout, so 'Error: …' was invisible
   to the user. Logged errors are discoverable via hermes logs.

Dropped the larger interrupt_queue→pending_input drain that was part of
the original PR — that's a separate class of input-drop (in-progress
interrupt handling) unrelated to the paste-file TOCTOU reported in the
issue, and worth its own review.

Salvage of #17939.
2026-05-02 02:07:14 -07:00
Teknium
5eac6084bc
fix(discord): warn on 32-char clamp collisions in the /skill collector (#18759)
Discord's per-command name limit is 32 chars. When two skill slugs
share the same first 32 chars (or a skill slug clamps onto a reserved
gateway command name), only the first seen wins — the second is
dropped from the /skill autocomplete. The old behavior incremented a
``hidden`` counter silently, so skill authors had no way to discover
the drop short of noticing their skill was missing from the picker.

Not an actively-biting bug today (no collisions on the default catalog
as of 2026-05), but a landmine the moment someone ships a skill with a
long name. The earlier series in #18745 / #18753 / #18754 dropped the
other silent data-loss paths in the Discord /skill collector; this one
lights up the last remaining one.

Fix: promote ``_names_used`` from a set to a dict keyed by the clamped
name, mapping to the source cmd_key (or a ``"<reserved>"`` sentinel
for names inherited via ``reserved_names``). On collision, log a
WARNING naming both sides — the winner, the loser, the clamped name,
and what to rename.

Two phrasings:

* skill-vs-skill — "both clamp to X on Discord's 32-char command-name
  limit; only the winner appears in /skill. Rename one skill's
  frontmatter ``name:`` to differ in its first 32 chars."
* skill-vs-reserved — "collides with a reserved gateway command name;
  the skill will not appear in /skill. Rename the skill's frontmatter
  ``name:``."

Tests: three cases in
``tests/hermes_cli/test_discord_skill_clamp_warning.py`` —
skill-vs-skill collision (warning names both cmd_keys + clamped prefix),
skill-vs-reserved collision (warning uses the distinct phrasing), and a
no-collision negative (zero warnings emitted).
2026-05-02 02:05:01 -07:00
teknium1
e363ced3c3 test(discord): regression coverage for zombie-websocket guard in connect()
Covers PR #18224 fix for issue #18187 — when DiscordAdapter.connect() is
called a second time without an intervening disconnect(), the previous
commands.Bot must be closed before a new one is created. Otherwise both
websockets stay connected to Discord's gateway and both fire on_message,
producing double responses with different wording.
2026-05-02 02:04:14 -07:00
luyao618
292d2fb42f fix(discord): close old client before reconnect to prevent zombie websockets (#18187)
When DiscordAdapter.connect() is called during reconnect, it creates a new
commands.Bot client without closing the previous one. The old client's
websocket remains connected to Discord's gateway, causing both to fire
on_message for every incoming event — resulting in double responses.

Fix: before creating a new Bot instance, check if a previous client exists
and close it. This ensures only one websocket connection is active at any
time.

Closes #18187
2026-05-02 02:04:14 -07:00
teknium1
0a6865b328 test(credential_pool): regression coverage for .env vs os.environ precedence
Covers PR #18256 fix for issue #18254 — when OPENROUTER_API_KEY is set in
BOTH os.environ (stale from parent shell) and ~/.hermes/.env (fresh),
_seed_from_env must prefer the .env value. Also guards the fallback case
where .env omits the key entirely (Docker/K8s/systemd deployments that
only inject via runtime env).
2026-05-02 02:00:32 -07:00
teknium1
9c626ef8ea chore(release): map franksong2702 email for AUTHOR_MAP
Follow-up for PR #18256 salvage.
2026-05-02 02:00:32 -07:00
Frank Song
2ef1ad280b fix: prefer ~/.hermes/.env over os.environ when seeding credential pool
When _seed_from_env() reads API keys to populate the credential pool, it
should treat ~/.hermes/.env as the authoritative source — not os.environ.
Stale env vars inherited from parent shell processes (Codex CLI, test
scripts, etc.) can shadow deliberate changes to the .env file, causing
auth.json to cache an outdated key that leads to silent 401 errors.

This is especially visible with OpenRouter: if a parent process exported
OPENROUTER_API_KEY=test-key-fresh and the user later updates .env with a
valid key, restarting Hermes still picks up the stale os.environ value,
writes it back to auth.json, and all API calls fail with 401.

Fixes #18254
2026-05-02 02:00:32 -07:00
Teknium
10297fa23c
fix(discord): /reload-skills now refreshes the /skill autocomplete live (#18754)
`_register_skill_group` captured the skill catalog in closure variables
(`entries` and `skill_lookup`) so the single `tree.add_command` call at
startup owned the only live copy. The closure is never re-entered after
startup, so `/reload-skills` — which rescans the on-disk skills dir and
refreshes the in-process `_skill_commands` registry — had no way to
propagate results into the `/skill` autocomplete on Discord. New skills
stayed invisible in the dropdown, and deleted skills returned
"Unknown skill" when the stale autocomplete entry was clicked.

The fix is purely a dataflow change: promote `entries` and `skill_lookup`
to instance attributes (`_skill_entries`, `_skill_lookup`), split the
collector-driven rebuild into a helper (`_refresh_skill_catalog_state`),
and add a public `refresh_skill_group()` method that re-runs the helper
and is safe to call at any point after the initial registration.

The gateway's `_handle_reload_skills_command` then iterates
`self.adapters` and calls `refresh_skill_group()` on any adapter that
exposes it (currently only Discord). Both sync and async implementations
are supported; adapters that don't override the method (Telegram's
BotCommand menu, Slack subcommand map, etc.) are silently skipped — the
in-process `reload_skills()` call covers them.

No `tree.sync()` is required because Discord fetches autocomplete
options dynamically on every keystroke — mutating the instance state the
callbacks already read from is sufficient. That sidesteps the per-app
command-bucket rate limit (~5 writes / 20 s) that made the previous
bulk-sync-on-reload approach unusable (#16713 context).

Tests: tests/gateway/test_reload_skills_discord_resync.py — five cases
covering (1) refresh replaces entries, (2) entries stay sorted after
refresh, (3) collector exception leaves cached state intact, (4)
`_refresh_skill_catalog_state` populates the instance attrs, (5)
orchestrator calls `refresh_skill_group()` on sync + async adapters and
skips adapters that don't expose it.
2026-05-02 02:00:11 -07:00
Teknium
6ec74aec07
fix(gateway): match disabled/optional skills by frontmatter slug, not dir name (#18753)
_check_unavailable_skill is meant to turn a typed "/foo" command that
doesn't resolve into a specific hint — "disabled, enable with hermes
skills config" or "available but not installed, install with hermes
skills install …" — instead of the generic "unknown command" reply.

It was doing the match with `skill_md.parent.name.lower().replace("_", "-")`,
comparing that to the typed command. For every skill whose directory name
drifted from its declared frontmatter `name:`, that comparison failed and
the user got the unhelpful generic path. On a standard install today 19
skills have this drift, e.g.:

  dir: mlops/stable-diffusion
  frontmatter: name: Stable Diffusion Image Generation
  registered slug (what the user types): /stable-diffusion-image-generation

  dir: mlops/qdrant
  frontmatter: name: Qdrant Vector Search
  registered slug: /qdrant-vector-search

  dir: mlops/flash-attention
  frontmatter: name: Optimizing Attention Flash
  registered slug: /optimizing-attention-flash

In every case, _check_unavailable_skill would fall through because
"stable-diffusion" != "stable-diffusion-image-generation", even with the
skill sitting right there on disk.

Fix: extract a small `_skill_slug_from_frontmatter` helper that reads the
SKILL.md frontmatter and normalizes exactly like scan_skill_commands
(lower, spaces/underscores → hyphens, strip non-[a-z0-9-], collapse
runs of hyphens, strip edges). Use it in both the
disabled-skills branch and the optional-skills branch. The disabled-set
membership check now uses the declared frontmatter name (which is what
`hermes skills config` writes into skills.disabled / platform_disabled),
not the slug.

Tests: five cases in tests/gateway/test_unavailable_skill_hint.py —
the drift case for the disabled branch, unknown-command negative,
matched-but-not-disabled negative, non-alnum stripping, and the drift
case for the optional-skills branch. All five fail against main and
pass with the fix.
2026-05-02 02:00:09 -07:00
Teknium
8825e9044c
fix(discord): complete #18741 for /skill autocomplete and drop legacy 25x25 caps (#18745)
``discord_skill_commands_by_category`` was lagging the flat
``discord_skill_commands`` collector on two counts. Both were actively
dropping skills from Discord's ``/skill`` autocomplete dropdown.

1. External-dir skills were filtered out. #18741 widened the flat
   collector to accept ``SKILLS_DIR + skills.external_dirs`` but left
   this sibling collector — the one ``_register_skill_group`` actually
   uses on Discord — still matching ``SKILLS_DIR`` only. External
   skills were visible in ``hermes skills list`` and the agent's
   ``/skill-name`` dispatch but silently absent from Discord's
   ``/skill`` picker. Widen the accepted roots to match, and derive
   categories from whichever root the skill lives under so
   ``<ext>/mlops/foo/SKILL.md`` still lands in the ``mlops`` group.

2. 25-group × 25-subcommand caps were still applied. PR #11580
   refactored ``/skill`` to a flat autocomplete (whose options Discord
   fetches dynamically — no per-command payload concern) and its
   docstring promises "no hidden skills." The collector kept the old
   nested-layout caps anyway, silently dropping anything past the 25th
   alphabetical category. On installs with 29 category dirs today (real
   example: tail categories ``social-media``, ``software-development``,
   ``yuanbao`` going missing) this was biting immediately. Remove the
   caps; ``hidden`` now reports only 32-char name-clamp collisions
   against reserved names.

Tests: guard both behaviors. ``test_no_legacy_25x25_cap`` builds 30
categories × 30 skills each and asserts all 900 are returned.
``test_external_dirs_skills_included`` monkeypatches
``get_external_skills_dirs`` and asserts an external-dir skill makes
it into the result grouped under its own top-level directory.
2026-05-02 02:00:06 -07:00
Jacob Lizarraga
2470434d60 fix(telegram): probe polling liveness after reconnect to detect wedged Updater
After a transient Telegram 502, _handle_polling_network_error's
stop()+start_polling() cycle can leave PTB's Updater with `running=True`
but a wedged consumer task that never makes progress. No error_callback
fires in that state, so the reconnect ladder never advances past attempt
1, the MAX_NETWORK_RETRIES fatal-error path is never reached, and the
gateway sits silent indefinitely.

Schedule a heartbeat probe (60s after a successful reconnect) that
verifies Updater.running is still True and bot.get_me() responds within
a tight asyncio.wait_for timeout. Either failure feeds back into the
reconnect ladder so the existing escalation path fires.

No PTB-internal coupling, no Application rebuild — minimal additive
defense inside the existing reconnect abstraction.

Tests cover healthy / Updater non-running / probe timeout / probe
network error / already-fatal cases, plus an integration check that the
probe is actually scheduled after a successful start_polling().

Closes the silent-wedge case observed in the wild after a transient
Telegram 502; existing reconnect tests updated to mock bot.get_me() now
that the success path schedules a heartbeat probe.
2026-05-02 01:55:04 -07:00
liuhao1024
9bf260472b fix(tools): deduplicate tool names at API boundary for Vertex/Azure/Bedrock
Providers like Google Vertex, Azure, and Amazon Bedrock reject API
requests with duplicate tool names (HTTP 400: 'Tool names must be
unique').  The upstream injection paths in run_agent.py already dedup
after PR #17335, but two API-boundary functions pass tools through
without checking:

- agent/auxiliary_client.py: _build_call_kwargs() (all non-Anthropic
  providers in chat_completions mode)
- agent/anthropic_adapter.py: convert_tools_to_anthropic() (Anthropic
  Messages API path)

Add defensive dedup guards at both sites.  Duplicates are dropped with
a warning log, converting a hard 400 failure into a recoverable
condition.  This is intentionally conservative — the root-cause dedup
in run_agent.py is the primary defense; these guards add resilience
against future injection-path regressions.

Includes 8 new tests covering unique passthrough, duplicate removal,
empty/None edge cases.

Closes #18478
2026-05-02 01:51:51 -07:00
Teknium
699b3679bc
fix(constants): warn once when get_hermes_home() falls back under an active profile (#18746)
When HERMES_HOME is unset but ~/.hermes/active_profile names a non-default
profile, any data this process writes lands in the default profile — not the
one the operator expects. Before this change the fallback was silent, so
cross-profile contamination (#18594) was invisible until a user noticed
their memory/state ended up in the wrong place.

Now we emit a one-shot warning to stderr the first time this happens in
a process. No raise — there are 30+ module-level callers of get_hermes_home()
and raising from any of them would brick import. Behavior is otherwise
unchanged; subprocess spawners (systemd template, kanban dispatcher, docker
entrypoint) already propagate HERMES_HOME correctly.

Bypasses logging.getLogger() because this runs before logging is configured
in a significant fraction of callers (module import time).

Refs #18594. Credit to @liuhao1024 for surfacing the silent-fallback case
in PR #18600; we kept the diagnostic signal without the import-time raise.
2026-05-02 01:49:55 -07:00
teknium1
98c98821ff chore(release): map CoreyNoDream email for AUTHOR_MAP
Follow-up for PR #18721 salvage.
2026-05-02 01:40:31 -07:00
CoreyNoDream
c5e3a6fb5b fix(cli): decode .env as UTF-8 to avoid GBK crash on Windows
Path.read_text() uses the system locale by default. On Windows CN/JP/KR
locales (GBK/CP932/CP949), reading a UTF-8 .env raises UnicodeDecodeError
as soon as it contains any non-ASCII byte (e.g. an em dash).

Pin encoding="utf-8" on every .env read in hermes_cli to match how the
rest of the codebase (load_dotenv at doctor.py:26) already decodes it.

Adds a regression test that monkeypatches Path.read_text to simulate a
GBK locale and asserts 'hermes doctor' no longer raises.

Refs #18637
2026-05-02 01:40:31 -07:00
Teknium
e2cea6eeba
fix(gateway): include external_dirs skills in Telegram/Discord slash commands (#18741)
Skills configured through `skills.external_dirs` in config.yaml were
visible via `hermes skills list`, `get_skill_commands()`, and the
agent's `/skill-name` dispatch, but silently excluded from the
Telegram and Discord slash-command menus. The filter in
`_collect_gateway_skill_entries` only accepted skills whose
`skill_md_path` started with `SKILLS_DIR`, so anything under an
external directory fell through.

Widen the accepted-prefix set to include all configured external
dirs alongside the local skills dir. Every prefix is now
slash-terminated so `/my-skills` cannot also admit
`/my-skills-extra`. Also guard against empty `skill_md_path`
values so they can't accidentally match.

Fixes #8110

Salvages #8790 by luyao618.

Co-authored-by: Yao <34041715+luyao618@users.noreply.github.com>
2026-05-02 01:36:57 -07:00
Teknium
c73594fe41
fix(skills): rescan skill_commands cache when platform scope changes (#18739)
The process-global `_skill_commands` dict in agent/skill_commands.py
was seeded by whichever platform scanned first, and
`get_skill_commands()` only rescanned when the cache was empty. In a
long-lived gateway process serving multiple platforms (Telegram +
Discord + Slack), the first platform's
`skills.platform_disabled` view was silently inherited by the
others — so a skill disabled for Telegram would also disappear from
Discord's slash menu, and vice versa.

Track the platform scope the cache was populated for
(`_skill_commands_platform`) and rescan in `get_skill_commands()`
when the currently-active platform no longer matches. Platform
resolution uses the same precedence as `_is_skill_disabled`:
`HERMES_PLATFORM` env var then `HERMES_SESSION_PLATFORM` from the
gateway session context.

Fixes #14536

Salvages #14570 by LeonSGP43.

Co-authored-by: LeonSGP <leon@sgp43.com>
2026-05-02 01:36:53 -07:00
Teknium
97acd66b4c
fix(curator): authoritative absorbed_into on delete + restore cron skill links on rollback (#18671) (#18731)
* fix(curator): authoritative absorbed_into declarations on skill delete

Closes #18671. The classification pipeline that feeds cron-ref rewriting
used to infer consolidation vs pruning from two brittle signals: the
curator model's post-hoc YAML summary block, and a substring heuristic
scanning other tool calls for the removed skill's name. Both miss in
real consolidations — the model forgets the YAML under reasoning
pressure, and the heuristic misses when the umbrella's patch content
describes the absorbed behavior abstractly instead of naming the old
slug. When both miss, the skill falls through to 'no-evidence fallback'
pruned, and #18253's cron rewriter drops the cron ref entirely instead
of mapping it to the umbrella. Same observable symptom as pre-#18253:
'Skill(s) not found and skipped' at the next cron run.

The fix makes the model declare intent at the moment of deletion.
skill_manage(action='delete') now accepts absorbed_into:
  - absorbed_into='<umbrella>'  -> consolidated, target must exist on disk
  - absorbed_into=''            -> explicit prune, no forwarding target
  - missing                     -> legacy path, falls through to heuristic/YAML

The curator reconciler reads these declarations off llm_meta.tool_calls
BEFORE either the YAML block or the substring heuristic. Declaration
wins. Fallback logic stays intact for backward compat with any caller
(human or older curator conversation) that doesn't populate the arg.

Changes
- tools/skill_manager_tool.py: add absorbed_into param to skill_manage
  + _delete_skill. Validate target exists when non-empty. Reject
  absorbed_into=<self>. Wire through dispatcher + registry + schema.
- agent/curator.py: new _extract_absorbed_into_declarations() walks
  tool calls for skill_manage(delete) with the arg. _reconcile_classification
  accepts absorbed_declarations= and treats them as authoritative. Curator
  prompt updated to require the arg on every delete.
- Tests: 7 new skill_manager tests covering the tool contract (valid
  target, empty string, nonexistent target, self-reference, whitespace,
  backward compat, dispatcher plumbing). 11 new curator tests covering
  the extractor + authoritative reconciler path + mixed-legacy-and-
  declared runs.

Validation
- 307/307 targeted tests pass (curator + cron + skill_manager suites).
- E2E #18671 repro: 3 narrow skills, 1 umbrella, cron job referencing
  all 3. Model emits NO YAML block. Heuristic misses (patch prose
  doesn't name old slugs). Delete calls carry absorbed_into. Result:
  both PR skills correctly classified 'consolidated' + cron rewritten
  ['pr-review-format', 'pr-review-checklist', 'stale-junk'] ->
  ['hermes-agent-dev']; stale-junk pruned via absorbed_into=''.
- E2E backward-compat: delete without absorbed_into, model emits YAML
  -> routed via existing 'model' source, cron still rewritten correctly.

* feat(curator): capture + restore cron skill links across snapshot/rollback

Before this, rolling back a curator run restored the skills tree but cron
jobs still pointed at the umbrella skills the curator had rewritten them
to. The user would see their old narrow skills back on disk but their
cron jobs still configured with the merged umbrella — not actually 'back
to how it was'.

Snapshot side: snapshot_skills() now captures ~/.hermes/cron/jobs.json
alongside the skills tarball, as cron-jobs.json. The manifest gets a new
'cron_jobs' block with {backed_up, jobs_count} so rollback (and the CLI
confirm dialog) can surface what's in the snapshot. If jobs.json is
missing/unreadable/malformed, snapshot proceeds without cron data — the
skills backup is the core guarantee; cron is additive.

Rollback side: after the skills extract succeeds, the new
_restore_cron_skill_links() reconciles the backed-up jobs into the live
jobs.json SURGICALLY. Only 'skills' and 'skill' fields are restored, and
only on jobs matched by id. Everything else about a cron job — schedule,
last_run_at, next_run_at, enabled, prompt, workdir, hooks — is live
state the user or scheduler has modified since the snapshot; overwriting
it would regress unrelated activity.

Reconciliation rules:
- Job in backup AND live, skills differ  → skills restored.
- Job in backup AND live, skills match   → no-op.
- Job in backup, NOT in live             → skipped (user deleted it
                                              after snapshot; their choice
                                              is later than the snapshot).
- Job in live, NOT in backup             → untouched (user created it
                                              after snapshot).
- Snapshot missing cron-jobs.json at all → rollback still succeeds,
                                              reports 'not captured'
                                              (older pre-feature snapshots
                                              keep working).

Writes go through cron.jobs.save_jobs under the same _jobs_file_lock the
scheduler uses, so rollback doesn't race tick().

Also:
- hermes_cli/curator.py: rollback confirm dialog now shows
  'cron jobs: N (will be restored for skill-link fields only)' when the
  snapshot has cron data, or 'not in snapshot (<reason>)' otherwise.
- rollback()'s message string includes a 'cron links: ...' clause
  summarizing the reconciliation outcome.

Tests
- 9 new cases: snapshot-with-cron, snapshot-without-cron, malformed-json
  captured-as-raw, full rollback-restores-skills-and-cron, rollback
  touches only skill fields, rollback skips user-deleted jobs, rollback
  leaves user-created jobs untouched, rollback still works with
  pre-feature snapshot that has no cron-jobs.json, standalone unit test
  on _restore_cron_skill_links exercising the full report shape.

Validation
- 484/484 targeted tests pass (curator + cron + skill_manager suites).
- E2E: real snapshot_skills, real cron rewrite, real rollback. Before:
  ['pr-review-format', 'pr-review-checklist', 'pr-triage-salvage'].
  After curator: ['hermes-agent-dev']. After rollback: ['pr-review-format',
  'pr-review-checklist', 'pr-triage-salvage']. Non-skill fields (id,
  name, prompt) preserved across the round trip.
2026-05-02 01:29:57 -07:00
Siddharth Balyan
f98b5d00a4
fix: gateway systemd unit now retries indefinitely with backoff (#18639)
The old defaults (StartLimitIntervalSec=600, StartLimitBurst=5,
RestartSec=30) meant any network outage over ~5 minutes would
permanently kill the gateway until manual intervention.

Changes:
- StartLimitIntervalSec=0 (never give up)
- Restart=always (not just on-failure)
- RestartSec=60 with RestartMaxDelaySec=300, RestartSteps=5
  (exponential backoff: 60 → 120 → 180 → 240 → 300s cap)
- After=network-online.target + Wants= (both units now wait for
  actual connectivity, not just network.target)

Power outage → internet down → internet back = auto-recovery.
2026-05-02 08:51:30 +05:30
Siddharth Balyan
585d6778da
fix: allow WebSocket connections from non-loopback IPs in --insecure mode (#18633)
When the dashboard is bound to 0.0.0.0 with --insecure (e.g. behind
Tailscale Serve), WebSocket endpoints (/api/pty, /api/ws, /api/pub,
/api/events) rejected connections from non-loopback client IPs with
code 4403 — causing 'events feed disconnected' in the UI.

Extract the repeated loopback check into _ws_client_is_allowed() which
respects the public bind flag. Session token auth still guards all
endpoints regardless of bind mode.
2026-05-02 08:17:45 +05:30
kshitijk4poor
f903ceece0 chore: add contributors to AUTHOR_MAP for Slack batch salvage
Some checks failed
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-and-push (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
Tests / test (push) Waiting to run
Tests / e2e (push) Waiting to run
Nix Lockfile Fix / auto-fix-main (push) Has been cancelled
Nix Lockfile Fix / fix (push) Has been cancelled
Adds email→username mappings for:
- priveperfumes (PR #18456)
- amroessam (PR #17798)
- Hinotoi-agent (PR #9361)
- valda (PR #14932)
2026-05-01 14:01:26 -07:00
Amr Essam
d05a87e686 fix(gateway): clear slack assistant thread status 2026-05-01 14:01:26 -07:00
hinotoi-agent
a147164d3c fix(slack): preserve per-user slash-command session isolation 2026-05-01 14:01:26 -07:00
nightq
5cdc39e29a fix(gateway): preserve case-sensitive chat IDs in DeliveryTarget.parse
Fixes NousResearch/hermes-agent#11768

Root cause: target.strip().lower() was lowercasing the entire target string,
corrupting case-sensitive chat IDs like Slack C123ABC and Matrix !RoomABC.

Fix: Only lowercase the platform prefix for case-insensitive matching;
preserve the original case for chat_id and thread_id values.
2026-05-01 14:01:26 -07:00
YAMAGUCHI Seiji
2b3923ff13 fix(gateway): coerce scalar free_response_channels to str before split
YAML loads a bare numeric value such as
    discord:
      free_response_channels: 1491973769726791812
as an int.  _discord_free_response_channels() / _slack_free_response_channels()
checked `isinstance(raw, list)` and `isinstance(raw, str)` in that order and
then fell through to `return set()`, so a single-channel config that happened
to be unquoted was silently dropped with no log line — the bot kept demanding
@mentions even though the channel was configured to free-response.

A multi-channel value like `1234567890,9876543210` does not trip this because
the comma forces YAML to parse it as a string.  Single-channel configs are
the only case that breaks, which is exactly the footgun that's hardest to
diagnose (the config "looks right" and the feature just doesn't activate).

Note that the old-schema env-var bridge at gateway/config.py:614+ already
runs `str(frc)` when forwarding to SLACK_/DISCORD_FREE_RESPONSE_CHANNELS,
so the env-var fallback worked.  The bug only surfaces on the
`config.extra["free_response_channels"]` path populated by the `platforms:`
bridge at gateway/config.py:576, which passes the raw YAML value through
unchanged.

Fix at the reader: treat any non-list value as a scalar, coerce with str(),
then apply the same CSV split semantics.  This keeps the public contract
stable (list or str-like continues to work identically) while accepting
the ints that the YAML loader is free to hand us.

Added tests for both Discord and Slack covering:
  - bare int value in config.extra
  - list of ints in config.extra
2026-05-01 14:01:26 -07:00
Prive FE Coder
a717199bbf fix(slack): exclude reserved Slack commands from native slash manifest
Slack has built-in slash commands (e.g. /status, /me, /join) that apps
cannot register. When running `hermes slack manifest --write`, the
generated manifest included /status, causing Slack to reject the entire
manifest with a reserved-command error.

Add _SLACK_RESERVED_COMMANDS frozenset of all known Slack built-ins and
skip them in slack_native_slashes(). Affected commands remain reachable
via /hermes <command>.

Tests updated:
- New test_excludes_slack_reserved_commands validates no leaks
- test_includes_canonical_commands no longer asserts /status
- test_telegram_parity accounts for expected Slack-only exclusions
2026-05-01 14:01:26 -07:00
kshitijk4poor
8fcc160f6b fix(gateway/slack): review fixes — scope ephemeral to commands, user isolation
Self-review fixes for the slash ephemeral ack:

- Only stash response_url when text starts with '/' (gateway command).
  Free-form questions via '/hermes <question>' must produce public agent
  replies visible to the whole channel, not ephemeral.
- Use a ContextVar (_slash_user_id) to thread the invoking user's ID
  from _handle_slash_command through to send().  _pop_slash_context now
  matches the exact (channel_id, user_id) key when the ContextVar is
  set, preventing concurrent users on the same channel from stealing
  each other's ephemeral context.  ContextVars propagate to child
  asyncio.Tasks, so the value survives through handle_message →
  _process_message_background → _send_with_retry → send().
- Add truncate_message() in _send_slash_ephemeral to prevent silent
  failures on long responses (response_url has the same ~40k limit).
- Log send_private_notice failures at debug level instead of bare
  except/pass — aids diagnostics without spamming.
- Document app_mention dedup dependency on shared event ts.
- Add tests: free-form question must NOT stash context, concurrent
  users on the same channel get isolated contexts, non-slash send()
  path fallback behavior.
2026-05-01 13:33:06 -07:00
kshitijk4poor
f34d298495 chore: add probepark to AUTHOR_MAP
Required for contributor_audit.py strict mode on the salvaged
PR #9340 commit.
2026-05-01 13:33:06 -07:00
probepark
0ab2d752ff feat(gateway): private notice delivery and Slack format_message fixes
Adds platform-level private notice delivery abstraction so operational
messages (e.g. sethome prompt) can be sent ephemerally on Slack when
configured with `slack.notice_delivery: private`.

Changes:
- gateway/config.py: _normalize_notice_delivery() + GatewayConfig.get_notice_delivery()
  with per-platform config bridging
- gateway/platforms/base.py: send_private_notice() default implementation
  (falls through to send())
- gateway/platforms/slack.py: send_private_notice() via chat_postEphemeral
- gateway/run.py: _deliver_platform_notice() helper replaces direct
  adapter.send() for the sethome notice, with private→public fallback
- gateway/platforms/slack.py: app_mention handler now forwards to
  _handle_slack_message (safe due to ts-based dedup) instead of no-op pass,
  fixing edge-case Slack configs where mentions arrive only as app_mention
- gateway/platforms/slack.py format_message: negative lookbehind prevents
  markdown images (![]()) from becoming broken Slack links; italic regex
  now requires non-whitespace boundaries so 'a * b * c' stays literal

Based on PR #9340 by @probepark.
2026-05-01 13:33:06 -07:00
kshitijk4poor
7cda0e5224 fix(gateway/slack): ephemeral ack and routing for slash commands
Slack slash commands (/q, /btw, /stop, /model, etc.) previously showed
no user-visible acknowledgement and posted command replies as public
channel messages.  This diverged from Discord, which uses ephemeral
deferred responses for slash commands.

Changes:
- handle_hermes_command now passes response_type='ephemeral' and a
  'Running /cmd…' text to ack(), giving the user immediate 'Only visible
  to you' feedback when they invoke any native slash command.
- _handle_slash_command stashes the Slack response_url from the command
  payload in a per-channel context dict before dispatching to
  handle_message.
- send() checks for a pending slash context and, when found, POSTs to
  the response_url with replace_original=true to swap the initial ack
  with the real command reply (e.g. 'Queued for the next turn.'),
  keeping it ephemeral.
- Stale slash contexts are garbage-collected on lookup (120s TTL).
- The response_url POST is non-fatal: if it fails, the user already saw
  the initial ack, and send() returns success=True.

Fixes #18182
2026-05-01 13:33:06 -07:00
Jeffrey Quesnelle
0b76d23d1a
makes the Persistent Goals docs accessible in the docs nav (and llms.txt) (#18481) 2026-05-01 10:29:22 -07:00
Teknium
f99676e315
fix(gateway): auto-restart when source files change out from under us (#17648) (#18409)
Long-running gateway processes that survive 'hermes update' keep
pre-update modules cached in sys.modules. When new tool files on
disk then try to 'from hermes_cli.config import cfg_get' (added in
PR #17304), the import resolves against the stale module object
and raises ImportError — hitting users on Matrix, Telegram, Feishu,
and other platforms.

Two defenses:

1. Gateway self-check (gateway/run.py). On __init__, snapshot the
   newest mtime across sentinel source files (hermes_cli/config.py,
   run_agent.py, gateway/run.py, etc.). On every inbound message,
   re-read those mtimes; if any is newer than boot time + 2s slack,
   request a graceful restart via the normal drain path and return
   a one-line ack to the user. Idempotent, works regardless of how
   the update happened (hermes update, manual git pull, installer).

2. Post-restart survivor sweep ('hermes update'). After the existing
   restart loop, sleep 3s, rescan for gateway PIDs we already tried
   to kill, and SIGKILL any survivors. The detached profile watchers
   and systemd then relaunch with fresh code instead of waiting out
   the 120s watcher timeout.

Closes #17648.
2026-05-01 09:50:08 -07:00
Teknium
77c0bc6b13
fix(curator): defer first run and add --dry-run preview (#18373) (#18389)
* fix(curator): defer first run and add --dry-run preview (#18373)

Curator was meant to run 7 days after install, not on the very first
gateway tick. On a fresh install (no .curator_state), should_run_now()
returned True immediately because last_run_at was None — so the gateway
cron ticker fired Curator against a fresh skill library moments after
'hermes update'. Combined with the binary 'agent-created' provenance
model (anything not bundled and not hub-installed), this consolidated
hand-authored user workflow skills without consent.

Changes:
- should_run_now(): first observation seeds last_run_at='now' and returns
  False. The next real pass fires one full interval_hours later (7 days
  by default), matching the original design intent.
- hermes curator run --dry-run: produces the same review report without
  applying automatic transitions OR permitting the LLM to call
  skill_manage / terminal mv. A DRY-RUN banner is prepended to the
  prompt and the caller skips apply_automatic_transitions. State is
  NOT advanced so a preview doesn't defer the next scheduled real pass.
- hermes update: prints a one-liner on fresh installs pointing at
  --dry-run, pause, and the docs. Silent on steady state.
- Docs: curator.md and cli-commands.md explain the deferred first-run
  behavior and warn that hand-written SKILL.md files share the
  'agent-created' bucket, with guidance to pin or preview before the
  first pass.

Tests:
- test_first_run_defers replaces the old 'first run always eligible'
  assertion — same fixture, inverted expectation.
- test_maybe_run_curator_defers_on_fresh_install covers the gateway tick
  path end-to-end.
- Three new dry-run tests cover state-advance suppression, prompt
  banner injection, and apply_automatic_transitions skipping.

Fixes #18373.

* feat(curator): pre-run backup + rollback (#18373)

Every real curator pass now snapshots ~/.hermes/skills/ into
~/.hermes/skills/.curator_backups/<utc-iso>/skills.tar.gz before calling
apply_automatic_transitions or the LLM review. If a run consolidates or
archives something the user didn't want touched, 'hermes curator
rollback' restores the tree in one command. Dry-run is skipped — no
mutation means no snapshot needed.

Changes:
- agent/curator_backup.py (new): tar.gz snapshot + safe rollback. The
  snapshot excludes .curator_backups/ (would recurse) and .hub/ (managed
  by the skills hub). Extract refuses absolute paths and .. components,
  and uses tarfile's filter='data' on Python 3.12+. Rollback takes a
  pre-rollback safety snapshot FIRST, stages the current tree into
  .rollback-staging-<ts>/ so the extract lands in an empty dir, and
  cleans the staging dir on success. A failed extract restores the
  staged contents.
- agent/curator.py: run_curator_review() calls curator_backup.
  snapshot_skills(reason='pre-curator-run') before apply_automatic_
  transitions. Best-effort — a failed snapshot logs at debug and the
  run continues (a transient disk issue shouldn't silently disable
  curator forever).
- hermes_cli/curator.py: new 'hermes curator backup' and 'hermes curator
  rollback' subcommands. rollback supports --list, --id <ts>, -y.
- hermes_cli/config.py: curator.backup.{enabled, keep} config block
  with sane defaults (enabled=true, keep=5).
- Docs: curator.md gets a 'Backups and rollback' section; cli-commands
  .md table gets the new rows.

Tests (new file tests/agent/test_curator_backup.py, 16 cases):
- snapshot creates tarball + manifest with correct counts
- snapshot excludes .curator_backups/ (recursion guard) and .hub/
- snapshot disabled via config returns None without creating anything
- snapshot uniquifies ids within the same second (-01 suffix)
- prune honors keep count, newest-first
- list_backups + _resolve_backup cover newest-default and unknown-id
- rollback restores a deleted skill with content intact
- rollback is itself undoable — safety snapshot shows up in list_backups
- rollback with no snapshots returns an error
- rollback refuses tarballs with absolute paths or .. components
- real curator runs take a 'pre-curator-run' snapshot; dry-runs do not

All curator tests: 210 passing locally.
2026-05-01 09:49:59 -07:00
Siddharth Balyan
c5b4c48165
fix: lazy session creation — defer DB row until first message (#18370)
Prevents ghost sessions from accumulating in state.db when the TUI/web
dashboard is opened and closed without sending a message.

Changes:
- run_agent.py: Add _ensure_db_session() gate method, called at
  run_conversation() entry. Remove eager create_session() from __init__.
  Handle compression rotation flag correctly.
- tui_gateway/server.py: Remove eager db.create_session() in
  _start_agent_build(). Add post-first-message pending_title re-apply.
- hermes_state.py: Extract _insert_session_row() shared helper (DRY).
  Add prune_empty_ghost_sessions() for one-time migration.
- cli.py: One-time ghost session prune on startup. Fix _pending_title
  to call _ensure_db_session() before set_session_title().
- hermes_cli/main.py: Guard TUI exit summary on message_count > 0.
- tests: Update test_860_dedup to call _ensure_db_session() before
  direct _flush_messages_to_session_db() calls.

Closes: ghost session clutter in hermes sessions list and web dashboard.
2026-05-01 18:39:12 +05:30
Austin Pickett
20132435c0
Merge pull request #18117 from NousResearch/austin/fix/model-selector
feat(tui): overhaul /model picker to match hermes model with inline auth
2026-05-01 05:30:05 -07:00
Austin Pickett
5ad030d19d
Merge pull request #18095 from NousResearch/austin/feat/plugins-page
feat(dashboard): Plugins page — manage, enable/disable, auth status
2026-05-01 05:29:24 -07:00
Austin Pickett
05c63259b5
Merge pull request #18358 from NousResearch/fix/kanban-buton
fix: kanban button
2026-05-01 04:49:06 -07:00
Austin Pickett
a01c1f7305 fix: kanban button 2026-05-01 07:33:54 -04:00
Siddharth Balyan
75e1339d4c
fix(telegram): send seed message after creating DM topics (#18334)
Telegram's client does not display empty forum topics in the chat's
topic list. After createForumTopic succeeds, send a short pin message
into the new topic so it becomes immediately visible to the user.

Only fires for newly created topics (no thread_id in config yet).
Failure to send the seed is non-fatal (debug-logged, topic still works).
2026-05-01 15:21:56 +05:30
Ben Barclay
0159f25fd0
Merge pull request #18281 from NousResearch/bb/fix-tui-docker-ink-v2
fix: prevent tui rebuilding assets
2026-05-01 18:43:40 +10:00
UgwujaGeorge
b7ad3f478f fix(yuanbao): enforce owner identity check on group slash commands
The bot-owner identity check inside OwnerCommandMiddleware was commented
out and replaced with a hardcoded `is_owner = True`, so any group member
could trigger allowlisted privileged commands (/approve, /deny, /stop,
/reset, /retry, /undo, /new, /background, /bg, /btw, /queue, /q) by
sending the slash command without @-mentioning the bot. The most severe
case is /approve: a non-owner could approve a dangerous tool call the
bot was waiting on the owner to confirm.

Re-enable the documented identity check (push.from_account ==
push.bot_owner_id) so only the configured owner can issue these
commands.
2026-04-30 23:57:55 -07:00
Teknium
a2a32688ca
docs(website): add User Stories and Use Cases collage page (#18282)
Adds a new top-of-sidebar docs page at /docs/user-stories that is a
masonry-style collage of 99 real user stories sourced from X/Twitter,
GitHub issues/PRs, Reddit, Hacker News, YouTube, blogs (Medium, Substack,
dev.to), podcasts, LinkedIn, GitHub Gists, and Product Hunt.

Every tile links to the original post/issue/video/gist where someone
described a specific use case: personal assistants, dev workflows,
trading bots, research briefs, family WhatsApp agents, Kubernetes
deployments, legal-domain self-hosted setups, and more.

- docs/user-stories.mdx: MDX entry mounting the collage component
- src/components/UserStoriesCollage: React component with category +
  source filters, CSS-columns masonry layout, per-category accent colors
- src/data/userStories.json: source-of-truth dataset (force-added; the
  root .gitignore's unanchored 'data/' rule would otherwise swallow it,
  same reason skills.json is explicitly listed in website/.gitignore)
- sidebars.ts: link added at the top of the docs sidebar
2026-04-30 23:56:59 -07:00
Ben
a49f4c617d fix: prevent tui rebuilding assets 2026-05-01 16:29:46 +10:00
web-dev0521
dfe512c58d fix(paths): route achievements plugin + profile-tui through HERMES_HOME
Four callsites hardcoded Path.home() / '.hermes' with no HERMES_HOME
check, breaking Docker deployments and profile isolation (hermes -p):

- plugins/hermes-achievements/dashboard/plugin_api.py:
  state_path(), snapshot_path(), checkpoint_path() bare-literal paths
- scripts/profile-tui.py:
  DEFAULT_STATE_DB and DEFAULT_LOG defaults ignored HERMES_HOME
- hermes_cli/slack_cli.py:
  except-Exception fallback for slack-manifest.json dump
- optional-skills/migration/openclaw-migration/scripts/openclaw_to_hermes.py:
  --target argparse default

Use get_hermes_home() (with an ImportError shim for the standalone
scripts) or 'os.environ.get("HERMES_HOME") or str(Path.home()/".hermes")'
where importing hermes_constants is impractical.

E2E-verified: with HERMES_HOME=/tmp/x all three achievements paths and
both profile-tui defaults route under /tmp/x.

Salvaged from #18068 (original scope was broader mechanical cleanup
claiming 23 callsites were buggy; most were already respecting
HERMES_HOME via os.environ.get(key, default) — only these 4 had no env
check at all). Credit: @web-dev0521.
2026-04-30 23:21:54 -07:00