Second migration of an existing built-in platform adapter after Discord
(PR #30591) — follows the same shape established by IRC / Teams / LINE /
Google Chat / SimpleX and the playbook in
`references/platform-plugin-migration.md`. Advances the umbrella refactor
in #3823.
Matches Discord's parity bar — adapter under `plugins/platforms/mattermost/`
with the standard `__init__.py` / `adapter.py` / `plugin.yaml` shell,
`register(ctx)` entry point, **no back-compat shim** at the old import
path, and full parity for all five hooks Discord uses plus the
`apply_yaml_config_fn` hook (mattermost is the second consumer of #25443
after Discord):
* `standalone_sender_fn` — out-of-process cron delivery via Mattermost
REST API. Picks up the thread_id + media_files capabilities the
legacy `_send_mattermost` lacked (parity with Discord's `_standalone_send`).
* `setup_fn` — interactive `hermes setup gateway` wizard.
* `apply_yaml_config_fn` — translates `config.yaml` `mattermost:` keys
(`require_mention`, `free_response_channels`, `allowed_channels`) into
`MATTERMOST_*` env vars (replaces the hardcoded block in
`gateway/config.py`).
* `is_connected` — declares connection state from `MATTERMOST_TOKEN` +
`MATTERMOST_URL`.
* `check_fn` — verifies aiohttp is installed and both required env vars
are set.
* plus `allowed_users_env`, `allow_all_env`, `cron_deliver_env_var`,
`max_message_length` (4000 — Mattermost practical limit), `emoji`,
`required_env`, `install_hint`.
Files
-----
* `gateway/platforms/mattermost.py` (873 LOC) →
`plugins/platforms/mattermost/adapter.py` (git rename, R071) +
appended `register()` block, hook helpers, and `_standalone_send`
with media upload + thread_id support.
* New `plugins/platforms/mattermost/{__init__.py, plugin.yaml}` with
`requires_env` / `optional_env` declarations covering MATTERMOST_URL,
MATTERMOST_TOKEN, MATTERMOST_ALLOWED_USERS, MATTERMOST_ALLOW_ALL_USERS,
MATTERMOST_HOME_CHANNEL, MATTERMOST_REPLY_MODE,
MATTERMOST_REQUIRE_MENTION, MATTERMOST_FREE_RESPONSE_CHANNELS,
MATTERMOST_ALLOWED_CHANNELS.
* `gateway/config.py`: delete 17-LOC `mattermost_cfg` YAML→env bridge
(moved into plugin's `_apply_yaml_config`).
* `gateway/run.py::_create_adapter`: delete `Platform.MATTERMOST elif` —
replaced by the existing generic plugin-registry-first dispatch.
* `tools/send_message_tool.py`: delete `_send_mattermost` (22 LOC) +
`Platform.MATTERMOST elif` in `_send_to_platform` — the `else` branch
already routes plugin platforms through `_send_via_adapter`, which
hits the registry's `standalone_sender_fn`.
* `hermes_cli/setup.py`: delete `_setup_mattermost` (44 LOC) — replaced
by the plugin's `interactive_setup`.
* `hermes_cli/gateway.py`: delete `_PLATFORMS["mattermost"]` dict entry
(3 LOC) — plugin's `setup_fn` is dispatched via the plugin path in
`_configure_platform`.
* Consumer rewrite: 5 test files (test_mattermost.py,
test_media_download_retry.py, test_send_multiple_images.py,
test_stream_consumer.py, test_ws_auth_retry.py) get
`gateway.platforms.mattermost` → `plugins.platforms.mattermost.adapter`
with the bulk-rewrite recipe from the platform-plugin-migration playbook.
Single `mock.patch` string in test_stream_consumer.py also repointed.
* `tests/tools/test_send_message_missing_platforms.py`: thin
`(token, extra, chat_id, message)` compat shim around the plugin's
`_standalone_send(pconfig, …)` so existing test bodies continue to
work without rewriting every signature.
Validation
----------
* Plugin discovery: mattermost registers from `plugins/platforms/mattermost/`
alongside discord / teams / irc / line / google_chat / simplex.
All 9 hooks present (setup_fn, standalone_sender_fn,
apply_yaml_config_fn, is_connected, check_fn, allowed_users_env,
allow_all_env, cron_deliver_env_var, max_message_length=4000).
* Mattermost-touching tests: 62/62 pass
(`test_mattermost.py` + `test_send_message_missing_platforms.py`).
* Targeted selectors (mattermost or platform_registry or stream_consumer
or ws_auth_retry or media_download_retry or send_multiple_images or
send_message_tool or platform_connected): 433/433 pass.
* Full sweep (`scripts/run_tests.sh tests/gateway/ tests/cron/
tests/tools/test_send_message_tool.py tests/tools/test_send_message_missing_platforms.py
tests/integration/`): **6220/6220 pass in 47.8s, 0 failures**.
* Lint: ruff clean on all touched files.
* Git identity verified: kshitijk4poor.
* Rename detection: R071 (similarity dropped from a hypothetical R09x
by the ~320-line appended register block — ~36% growth over the
873-LoC base, vs Discord's 5101 LoC base which kept R091).
Closes part of #3823.
PR #30136 review caught: `S6ServiceManager.start/stop/restart` called
`subprocess.run(check=True)` on `s6-svc`, so any failure surfaced as
a raw `CalledProcessError` traceback. The two cases operators
actually hit are:
1. The service slot doesn't exist — most commonly because the user
typed a profile name wrong (`hermes -p typo gateway start`).
2. s6-svc itself fails — most commonly EACCES on the supervise
control FIFO when running unprivileged.
Both deserve named errors with actionable messages, not stacktraces.
Changes:
* Add `S6Error` base + two concrete errors in `hermes_cli.service_manager`:
- `GatewayNotRegisteredError(profile)` — carries the unprefixed
profile name; message: `no such gateway 'typo': register it
with `hermes profile create typo` first, or pass an existing
profile name via `-p <name>``.
- `S6CommandError(service, action, returncode, stderr)` — carries
the s6-svc rc and stderr; message: `s6-svc start on
'gateway-coder' failed (rc=111): <stderr>`.
* Factor lifecycle dispatch through `_run_svc(flag, label, name)`:
pre-checks that the service directory exists (raises
GatewayNotRegisteredError before invoking s6-svc), then runs
s6-svc and translates any CalledProcessError into S6CommandError.
* `_dispatch_via_service_manager_if_s6` in `hermes_cli.gateway`
catches both errors and prints `✗ <message>` + `sys.exit(1)`
instead of letting the exception bubble. The dispatch path that
used to dump a traceback at the user now gives an actionable
one-liner.
Tests: 6 new tests for the error types and their CLI rendering;
existing lifecycle test pre-seeds the slot directory before calling
`mgr.start` etc.
PR #30136 review caught that `hermes gateway stop --all` and
`... restart --all` were broken under s6. The Phase 4 dispatcher was
gated on `not stop_all` (and the symmetric restart_all), so `--all`
fell through to `kill_gateway_processes(all_profiles=True)`. pkill
SIGTERMed every gateway, s6-supervise observed the crashes, and
restarted every gateway ~1s later — net effect: `--all` *kicked*
gateways instead of *stopping* them.
Add `_dispatch_all_via_service_manager_if_s6(action)` that iterates
`mgr.list_profile_gateways()` and routes stop/restart through each
service slot. s6's `want up`/`want down` flips correctly, so a
stop persists. Partial failures are surfaced per-profile with a
running success count; the host pkill path is only reached when s6
isn't in play.
`start --all` isn't a CLI surface — the helper rejects it and
returns False (host code path can take over).
Phase 5 of the s6-overlay supervision plan. Documentation + small
diagnostic cleanups; no behavior changes.
website/docs/user-guide/docker.md:
- Replace the old 'entrypoint script does the bootstrap' section
with the s6-overlay boot flow (cont-init.d/01-hermes-setup,
cont-init.d/02-reconcile-profiles, static main-hermes + dashboard
services, ENTRYPOINT-as-main-program pattern).
- Add a 'Per-profile gateway supervision' subsection covering the
new lifecycle commands, restart semantics, log persistence, and
'Manager: s6 (container supervisor)' status reporting.
- Add 'Breaking change vs. pre-s6 images' callout naming the
/init ENTRYPOINT and pointing affected wrappers at the pin
workaround.
website/docs/user-guide/profiles.md:
- Add a note under 'Persistent services' pointing container users
at the docker.md section explaining s6 supervision inside the
image. Host-side systemd/launchd documentation is unchanged.
skills/software-development/hermes-s6-container-supervision/SKILL.md:
- New maintainer skill covering the supervision-tree map, file
layout, the Architecture B rationale (cont-init.d args + halt
exit-code propagation), quick recipes, and the 8 pitfalls we hit
while implementing the plan (PATH-without-/command, root-owned
profile dirs, SOUL.md as marker, the '143' anti-pattern, etc.).
hermes_cli/doctor.py:
- _check_gateway_service_linger skips on s6 (the linger concept
doesn't apply inside the container).
- New _check_s6_supervision section reports main-hermes/dashboard
state and per-profile-gateway count (registered vs supervised
up), only inside the s6 container. Host doctor output unchanged.
- External Tools / Docker check no longer emits a 'docker not
found' warning inside the container; prints an explanatory
info line instead. Still respects an explicit TERMINAL_ENV=docker
(in case the user mounted /var/run/docker.sock).
hermes_cli/gateway.py:
- Document _container_systemd_operational more precisely: it's
NOT for our Hermes Docker image (s6-overlay handles that via
detect_service_manager() == 's6'). It still covers
systemd-nspawn / k8s-with-systemd-init cases, so leaving it in
place is correct; the docstring just makes that explicit.
Test harness (verification, no test changes in this commit):
19 passed, 0 xfailed. 66 service-manager / container-boot /
profiles-s6-hooks / gateway-s6-dispatch unit tests still green.
61 doctor tests still green. Hadolint + shellcheck clean.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Phase 4 of the s6-overlay supervision plan. Activates the Phase 3
S6ServiceManager by hooking it into the profile lifecycle and the
`hermes gateway start/stop/restart` dispatcher, and adds a cont-
init.d-time reconciliation pass that survives `docker restart`.
Task 4.0 — container-boot reconciliation:
/run/service/ is tmpfs, so every `docker restart` wipes every
per-profile gateway slot. /etc/cont-init.d/02-reconcile-profiles
invokes hermes_cli.container_boot.reconcile_profile_gateways() on
every boot, which walks $HERMES_HOME/profiles/<name>/, reads each
gateway_state.json, recreates the s6 service slot, and auto-starts
only those whose last state was 'running'. Other states
(stopped, starting, startup_failed, missing) register the slot
in the down state — avoiding crash-loops across restarts for a
gateway that was broken last boot. Per-profile outcome is recorded
to $HERMES_HOME/logs/container-boot.log.
Implementation: hermes_cli/container_boot.py + 12 unit tests.
Profile-marker is SOUL.md, not config.yaml, because `hermes profile
create` only seeds SOUL.md by default (config.yaml comes from
`hermes setup`).
Task 4.1 / 4.2 — profile create/delete hooks:
hermes_cli/profiles.py::create_profile now calls
_maybe_register_gateway_service(<canon>) at the end, which routes
through ServiceManager.register_profile_gateway when running on s6
and no-ops on host backends. delete_profile mirrors with
_maybe_unregister_gateway_service. _allocate_gateway_port produces
a deterministic SHA-256-derived port in [9200, 9800).
Task 4.3 — gateway dispatch + remove rejection arms:
_dispatch_via_service_manager_if_s6(action) intercepts
start/stop/restart at the top of each subcommand and routes them
through S6ServiceManager.{start,stop,restart}. The pre-Phase-4
`elif is_container():` rejection arms are kept as fallback for
pre-s6 containers / unsupported runtimes, but only ever fire when
detect_service_manager() != 's6'. install/uninstall under s6
print informational guidance pointing users at profile create/delete.
Removed the two xfail(strict=True) markers from
tests/docker/test_profile_gateway.py — both tests now pass strictly.
Task 4.4 — status reporting:
get_gateway_runtime_snapshot() reports
Manager: 's6 (container supervisor)' inside an s6 container instead
of 'docker (foreground)'.
Plan-vs-reality drift fixed in this commit:
- Plan's S6ServiceManager._render_run_script used
`gateway start --foreground --port {port}` — invented args; the
real CLI is `gateway run`. Switched accordingly. port arg
retained for API parity but now documented as 'currently ignored'.
- Plan's reconciler keyed on config.yaml; switched to SOUL.md
(config.yaml is created by hermes setup, not by hermes profile
create, so the original gate caught nothing).
- The plan's _dispatch helper used _profile_arg() which returns
'--profile <name>' (i.e. with the flag prefix). Switched to
_profile_suffix() which returns the bare name.
- Architecture B's docker exec doesn't get /command on PATH or
the venv on PATH; Dockerfile's runtime PATH now includes
/opt/hermes/.venv/bin so 'docker exec <c> hermes ...' works
without sourcing the venv.
- stage2-hook now chowns $HERMES_HOME/profiles to hermes on every
boot, not just on the UID-remap path. Without this, files created
by docker-exec-as-root accumulate and the next reconciler run
fails with PermissionError reading SOUL.md.
Test harness:
19 passed, 0 xfailed (the two pre-Phase-4 xfail targets flip to
passing). 78 unit tests across service_manager + container_boot +
profiles_s6_hooks + gateway_s6_dispatch. Hadolint + shellcheck
pass cleanly.
Refs: docs/plans/2026-05-07-s6-overlay-dynamic-subagent-gateways.md
Two bugs surfaced by PR #24356 migrating Discord into the registry:
1. plugins/platforms/discord/adapter.py::_is_connected — read DISCORD_BOT_TOKEN
via hermes_cli.gateway.get_env_value (the abstraction tests patch) instead
of os.getenv directly. The legacy non-registry path used get_env_value;
bypassing it broke test_setup_openclaw_migration which patches
gateway_mod.get_env_value to simulate a hermetic env.
2. hermes_cli/gateway.py::_platform_status — when entry.is_connected is
defined and returns False, return 'not configured' immediately. Don't
fall back to entry.check_fn(), which would let 'SDK is installed'
override 'no token configured' and incorrectly report the platform as
ready. The fallback to check_fn is the right behaviour only when
is_connected is None (not registered).
Fixes 5 test failures observed on CI for PR #24356:
- tests/hermes_cli/test_setup.py::test_setup_gateway_skips_service_install_when_systemctl_missing
- tests/hermes_cli/test_setup.py::test_setup_gateway_in_container_shows_docker_guidance
- tests/hermes_cli/test_setup_irc.py::TestIRCGatewaySetupFreshInstall::test_setup_gateway_irc_counts_as_messaging_platform
- tests/hermes_cli/test_setup_openclaw_migration.py::TestGetSectionConfigSummary::test_gateway_returns_none_without_tokens
- tests/hermes_cli/test_setup_openclaw_migration.py::TestSetupWizardSkipsConfiguredSections::test_sections_skipped_when_migration_imported_settings
Same _platform_status bug exists for sibling plugin platforms (teams,
google_chat) whose check_fn returns true on SDK install alone; their
tests just never exercised the registry path before. The bug only became
test-visible when Discord migrated into the registry.
Validation: 11,167 tests across tests/gateway/ + tests/cron/ +
tests/tools/test_send_message_tool.py + tests/hermes_cli/ pass with zero
failures.
First migration of an existing built-in platform adapter to the plugin
system established by IRC / Teams / LINE / Google Chat. Closes#24325;
advances the umbrella refactor in #3823.
Matches Teams' shape exactly — adapter under ``plugins/platforms/discord/``
with the standard ``__init__.py`` / ``adapter.py`` / ``plugin.yaml``
shell, ``register(ctx)`` entry point, **no back-compat shim** at the old
import path, and full parity for the four hooks Teams uses plus the
``apply_yaml_config_fn`` hook that landed in #25443 (the Discord plugin
is the first consumer of that hook):
* ``standalone_sender_fn`` — out-of-process cron delivery via REST API
* ``setup_fn`` — interactive ``hermes setup gateway`` wizard
* ``apply_yaml_config_fn`` — translate ``config.yaml`` ``discord:`` keys
into ``DISCORD_*`` env vars (replaces the hardcoded block in
``gateway/config.py``)
* ``is_connected`` — declares connection state from ``DISCORD_BOT_TOKEN``
* ``check_fn`` — lazy-installs ``discord.py`` on demand
* plus ``allowed_users_env``, ``allow_all_env``, ``cron_deliver_env_var``,
``max_message_length``, ``emoji``, ``required_env``, ``install_hint``
* ``gateway/platforms/discord.py`` (5,101 LOC) →
``plugins/platforms/discord/adapter.py`` (git rename, R090).
* New ``plugins/platforms/discord/{__init__.py, plugin.yaml}`` with
``requires_env`` / ``optional_env`` declarations.
* Append ``register(ctx)`` block + new hook implementations
(``_standalone_send``, ``interactive_setup``, ``_apply_yaml_config``,
``_clean_discord_user_ids``, ``_is_connected``, ``_build_adapter``,
plus helpers ``_DISCORD_CHANNEL_TYPE_PROBE_CACHE`` etc.) to the
adapter.
* Replace the ``Platform.DISCORD elif`` branch in
``GatewayRunner._create_adapter()`` (−9 LOC) with a generic post-creation
hook (+6 LOC) in the registry path: any plugin adapter that declares a
``gateway_runner`` attribute now gets it auto-injected. Webhook's
built-in branch is unchanged (it doesn't go through the registry path).
* Move ``_send_discord`` (190 LOC) and helpers
(``_DISCORD_CHANNEL_TYPE_PROBE_CACHE``, ``_remember_channel_is_forum``,
``_probe_is_forum_cached``, ``_derive_forum_thread_name``) from
``tools/send_message_tool.py`` into the plugin as ``_standalone_send``.
* Wire via ``standalone_sender_fn=_standalone_send`` (Teams pattern; same
gap fixed in #21804 for other plugin platforms).
* Replace the Discord ``elif`` in ``tools/send_message_tool.py``
``_send_to_platform`` with a 10-line registry-hook dispatch.
* Drop the ``DiscordAdapter`` import and the
``Platform.DISCORD: DiscordAdapter.MAX_MESSAGE_LENGTH`` ``_MAX_LENGTHS``
entry — the registry's ``max_message_length=2000`` covers it.
* Move ``_setup_discord`` and ``_clean_discord_user_ids`` (68 LOC) from
``hermes_cli/setup.py`` into the plugin as ``interactive_setup``.
* Wire via ``setup_fn=interactive_setup``. CLI helpers (``prompt``,
``print_info``, etc.) are lazy-imported so the plugin's module-load
surface stays minimal.
* Remove ``"discord": _s._setup_discord`` from
``hermes_cli/gateway.py::_builtin_setup_fn``.
* Remove the entire 32-line ``_PLATFORMS["discord"]`` static dict entry —
Discord's setup metadata is now discovered dynamically via
``_all_platforms()`` from the registry entry.
* Move the 59-line ``discord_cfg`` YAML→env bridge from
``gateway/config.py::load_gateway_config()`` into the plugin as
``_apply_yaml_config``. Covers ``require_mention``,
``thread_require_mention``, ``free_response_channels``, ``auto_thread``,
``reactions``, ``ignored_channels``, ``allowed_channels``,
``no_thread_channels``, ``allow_mentions.{everyone,roles,users,
replied_user}``, and ``reply_to_mode`` (including the YAML 1.1
``off``-as-False coercion and the ``extra.reply_to_mode`` fallback).
* Wire via ``apply_yaml_config_fn=_apply_yaml_config``.
* The hook runs BEFORE ``_apply_env_overrides`` and after the generic
shared-key loop, exactly as documented in
``website/docs/developer-guide/adding-platform-adapters.md``.
* Behavior is preserved exactly — every assignment still uses
``not os.getenv(...)`` guards so env vars take precedence over YAML.
All 78 references to the old import path are rewritten — no back-compat
shim:
* 51 ``from gateway.platforms.discord import X`` →
``from plugins.platforms.discord.adapter import X``
* 5 ``import gateway.platforms.discord as discord_platform`` →
``import plugins.platforms.discord.adapter as discord_platform``
* 1 ``from gateway.platforms import discord as discord_mod`` →
``from plugins.platforms.discord import adapter as discord_mod``
* 21 ``mock.patch("gateway.platforms.discord.X")`` strings →
``mock.patch("plugins.platforms.discord.adapter.X")``
* 1 docstring reference in ``hermes_cli/commands.py``
* 1 import in ``tools/send_message_tool.py`` (now removed entirely)
The import-safety test in ``tests/gateway/test_discord_imports.py`` is
updated to purge the new canonical module name from ``sys.modules``.
**38 files changed, +621 / −473** — net positive due to the YAML hook
implementation (89 new LOC in the plugin trading for 59 deleted in core),
but every line moved has a clear plugin home now. The git rename is
detected at R090 because the adapter gained ~340 LOC of moved-in hook
implementations (``_standalone_send`` + ``interactive_setup`` +
``_apply_yaml_config`` + helpers).
* All 568 Discord-specific tests pass across 25 ``test_discord_*.py``
files plus voice/send/text-batching/reload-skills/stream-consumer/
integration tests.
* All 147 tests in the YAML-touching subset
(``test_discord_reply_mode``, ``test_discord_free_response``,
``test_discord_allowed_channels``, ``test_discord_allowed_mentions``,
``test_discord_channel_controls``, ``test_discord_reactions``,
``test_discord_thread_persistence``, ``test_runtime_footer``) pass —
this is the strongest signal that the YAML→env hook behaves
identically to the legacy block.
* Broader gateway/cron/integration sweep (1297 tests) introduces zero
new failures vs ``main``. Pre-existing failures in
``tests/gateway/test_tts_media_routing.py`` and
``tests/e2e/test_platform_commands.py`` reproduce identically on the
unchanged ``main`` revision.
* Plugin discovery sanity check confirms Discord registers alongside the
other four platform plugins:
Registered platforms: ['discord', 'google_chat', 'irc', 'line', 'teams']
These Discord-shaped tendrils in core were **deliberately not moved** —
they are generic platform-registry concerns affecting every platform,
not Discord-specific:
* ``gateway/config.py:1205`` ``DISCORD_BOT_TOKEN → config.token`` env
enablement — same shape Telegram has. The existing
``env_enablement_fn`` registry hook only seeds ``extra``, not
``.token``, so it can't replace this without an adapter refactor to
read from ``extra["bot_token"]``.
* ``gateway/run.py`` voice-mode hooks
(``self.adapters.get(Platform.DISCORD)`` for
``start_voice_mode``/``stop_voice_mode``), role-based auth,
``DISCORD_ALLOW_BOTS`` branch in ``_is_user_authorized``,
``_UPDATE_ALLOWED_PLATFORMS`` frozenset, and the per-platform
allowlist maps — generic platform-registry concerns.
* ``Platform.DISCORD`` enum literal — stable identifier used as dict
keys throughout the codebase; removing it is a separate refactor with
no real benefit.
* ``tools/discord_tool.py`` and ``tools/environments/local.py`` —
first-class agent tools and env-passthrough config, neither is the
gateway adapter.
Each of these is worth its own scoping issue when the time comes.
Preserve Windows profile install decisions across UAC handoff, avoid visible console windows by launching via pythonw, make repeated install/start idempotent, recreate stale Scheduled Tasks, and separate start-now from login auto-start behavior. Add Windows gateway regression coverage and systemd setup tests for the shared install flow.
hermes_cli/gateway.py:3702 referenced logger.debug() but 'logger' was
never defined in the module, causing a NameError at runtime if the
try/except around discover_plugins() caught an exception.
Added import logging and logger = logging.getLogger(__name__)
at module level to resolve the undefined name.
Extract PATH building into _build_service_path_dirs() that skips directories
which don't exist on disk (e.g. node_modules/.bin for pip installs) and also
includes ~/.hermes/node/bin and ~/.hermes/node_modules/.bin for agent-browser.
* fix(install): use `--extra all` not `--all-extras`; drop lazy-covered extras from [all]
Two coupled fixes for the Windows install hang where uv sync built
python-olm from sdist and failed on missing make.
# Root cause: --all-extras vs --extra all (credit: ethernet)
`uv sync --all-extras` installs every key in [project.optional-
dependencies], bypassing the curated [all] extra entirely. So even
when [all] excluded [matrix], [rl], [yc-bench], etc., the installer
pulled them anyway because they were still defined as extras. On
Windows that meant python-olm (no wheel, needs make to build from
sdist) and the install died there.
The right flag is `--extra all` — install just the [all] extra's
contents, respecting curation. Empirically verified via dry-run:
--all-extras: pulls python-olm, mautrix, ctranslate2, onnxruntime,
atroposlib, tinker, wandb, modal, daytona, vercel,
python-telegram-bot, discord.py, slack-bolt,
dingtalk-stream, lark-oapi, anthropic, boto3,
edge-tts, elevenlabs, exa-py, fal-client, faster-
whisper, firecrawl-py, honcho-ai, parallel-web
--extra all: pulls none of those — just [all]'s curated set
Dockerfile already uses `--extra all` (with comment explaining the
gotcha) — knowledge existed; the gap was install.sh / install.ps1 /
setup-hermes.sh.
Sites fixed: scripts/install.sh L1118, scripts/install.ps1 L809,
setup-hermes.sh L245.
# Companion fix: drop lazy-covered extras from [all]
`tools/lazy_deps.py` already covers anthropic, bedrock, exa,
firecrawl, parallel-web, fal, edge-tts, elevenlabs, modal, daytona,
vercel, all messaging platforms (telegram/discord/slack/matrix/
dingtalk/feishu), honcho, and faster-whisper. They were ALSO in
[all], which defeats the whole point of lazy-install — fresh
installs eager-pulled them and inherited whatever was broken
upstream (the matrix → python-olm → no Windows wheel chain being
the proximate symptom).
[all] now contains only what genuinely can't be lazy-installed:
cron, cli, dev, pty, mcp, homeassistant, sms, acp, google, web,
youtube. Same trim applied to [termux-all]. New regression test
asserts the contract: every extra in LAZY_DEPS must NOT also appear
in [all].
# Companion fix: surface uv progress + errors
setup-hermes.sh's hash-verified path swallowed uv's stderr to a
tempfile, identical to the install.sh bug fixed in PR #24504. Same
fix applied: stream stderr through directly so users see live
progress instead of staring at a frozen prompt.
# Files
- pyproject.toml: trim [all] and [termux-all] to non-lazy extras only.
- scripts/install.sh: --all-extras → --extra all; trim _ALL_EXTRAS /
_PYPI_EXTRAS to match.
- scripts/install.ps1: --all-extras → --extra all; trim $allExtras /
$pypiExtras to match.
- setup-hermes.sh: --all-extras → --extra all; stream stderr.
- tests/test_project_metadata.py: invert matrix-in-[all] assertion;
add lazy-coverage contract test.
- uv.lock: regenerated.
# Validation
5/5 metadata tests pass. 37/37 in update_autostash + tool_token_
estimation. `uv lock --check` passes. Empirical dry-run confirms
`--extra all` excludes python-olm + RL chain on the new lockfile.
* fix(install): parse [all] from pyproject.toml instead of mirroring it
ethernet's review point: the previous patch left two hand-mirrored
copies of [all]'s contents (in install.sh's $_ALL_EXTRAS and
install.ps1's $allExtras). That guarantees future drift the next
time pyproject.toml's [all] changes.
Now both scripts parse pyproject.toml at install time using stdlib
tomllib (Python 3.11+, which the bootstrap step already requires).
Single source of truth. The only purpose of the parsed list is to
build the 'Tier 2: [all] minus broken extras' fallback spec — so we
parse, filter against $brokenExtras, and rebuild the .[a,b,c] spec.
Also: removed redundant fallback tiers.
Before: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 PyPI-only extras (no git deps)
Tier 4 [web,mcp,cron,cli,messaging,dev]
Tier 5 .
After: Tier 1 [all]
Tier 2 [all] minus broken
Tier 3 .
Tier 3 (PyPI-only) and Tier 4 (dashboard+core) used to dodge the [rl]
git+sdist deps and the [matrix] python-olm build. Both are no longer
in [all] post-2026-05-12 lazy-install migration, so the carve-out
tiers had no remaining content. Tier 4 also referenced [messaging],
which is now lazy-installed — the hardcoded fallback was actually
inconsistent with the new policy.
Defensive fallback: if tomllib parse fails (corrupted pyproject,
unexpected schema), Tier 2 collapses to '.[all]' (same as Tier 1) so
the broken-extras path becomes a no-op rather than crashing.
* fix(gateway): hide Matrix from setup picker on Windows
Matrix is the one messaging platform that has no working install path
on Windows: [matrix] -> mautrix[encryption] -> python-olm, which has
Linux-only wheels and needs make + libolm to build from sdist. The
[all] cleanup in this PR keeps mautrix out of fresh installs, but a
user who picked Matrix in 'hermes setup gateway' would still walk
into the same sdist build failure when the wizard tried to install
the extra.
Hide the option at the picker so users never get the chance to try.
The gate lives in _all_platforms() — single source of truth for the
setup wizard, the curses gateway-config menu, and any future picker.
Adapter loading at runtime is intentionally NOT gated: users who
already have MATRIX_* env vars set (e.g. config copied from a Linux
install) keep working if they somehow have python-olm available.
This is the lowest-friction fix — picker visibility only.
Tests cover linux/darwin/win32 and verify other platforms aren't
collateral damage.
Replace with for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.
608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
Salvage of NousResearch/hermes-agent#7622.
Docker images often lack procps so `ps` is unavailable. Try reading
/proc/*/cmdline first (works in any Linux container) and fall back to
`ps -A eww` only when /proc is not present. PermissionError on
individual PIDs is silently skipped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
run_gateway() calls refresh_systemd_unit_if_needed() on every invocation
so restart settings stay current after exit-code-75 respawns. The
user-scope unit path resolves under Path.home() (NOT sandboxed by
conftest, only HERMES_HOME is), and generate_systemd_unit() bakes the
current HERMES_HOME into the unit's Environment= line.
Result: any test that exercises run_gateway() end-to-end on a real
Linux dev box silently rewrites the developer's installed
~/.config/systemd/user/hermes-gateway.service with a polluted
HERMES_HOME pointing at /tmp/pytest-of-<user>/.../hermes_test. On the
next reboot, systemd loads that unit, the gateway starts looking at an
empty tmp dir, and Telegram/Discord/etc. all show as 'No messaging
platforms enabled' even though the user's real config is fine. Three
tests in tests/hermes_cli/test_gateway.py hit this path:
test_run_gateway_exits_cleanly_on_keyboard_interrupt,
test_run_gateway_exits_nonzero_when_start_gateway_reports_failure, and
test_run_gateway_root_guard_has_escape_hatch.
Two-layer fix:
1. _install_fake_gateway_run helper (covers all four run_gateway() call
sites in test_gateway.py and any future ones) now also stubs
supports_systemd_services and refresh_systemd_unit_if_needed.
2. refresh_systemd_unit_if_needed() itself sniffs the generated unit
body for /pytest-of- and /hermes_test markers and refuses to write
when present. Defense in depth so a future test that bypasses the
helper still can't corrupt the dev's gateway. Tests that legitimately
exercise the refresh flow (test_run_gateway_refreshes_outdated_unit_on_boot)
patch generate_systemd_unit to return synthetic content that doesn't
carry those markers, so they keep working.
Adds test_refresh_refuses_to_bake_pytest_tmpdir_into_real_user_unit as a
regression test for the source-side guard.
When systemd_restart / systemd_status / systemd_stop run under sudo,
HERMES_HOME is stripped and HOME=/root, so get_hermes_home() resolves
to /root/.hermes instead of the unit's pinned home. read_runtime_status
and get_running_pid then look at the wrong gateway_state.json — the
60s status poll never sees "running", times out, and forces another
systemctl restart that SIGTERMs the in-progress new gateway.
Read the unit's pinned HERMES_HOME from `systemctl show -p Environment`
and mirror it into os.environ before any HERMES_HOME-derived read.
Early-out when system=False (user-scope inherits naturally). Errors
swallowed so a transient systemctl failure doesn't break unrelated
CLI ops.
Closes#22035.
Extends #19994 to the restart path. Dashboard spawns 'hermes gateway
restart' in the background; when a wedged adapter websocket pushes
drain past the 90s CLI timeout, the dashboard previously surfaced a
raw subprocess.TimeoutExpired traceback.
Mirror systemd_stop()'s TimeoutExpired catch onto both forcing-restart
sites in systemd_restart(). Adds a test that exercises the no-active-pid
branch end-to-end.
Both setup wizards (hermes setup and hermes gateway setup) gated the
service install/start/restart prompts behind 'supports_systemd or
is_macos()' and fell through to 'run in foreground' on Windows, even
though _is_service_installed() / _is_service_running() already call
gateway_windows.is_installed() and the Windows backend has a full
install/start/stop/restart contract.
Wire the Windows branch into both wizards:
- supports_service_manager now includes is_windows().
- Install offer reads 'Scheduled Task service' on Windows.
- install() on Windows starts the task inline via schtasks /Run (or
direct-spawn fallback) so the separate 'Start the service now?'
prompt is skipped.
- Start and Restart delegate to gateway_windows.start() / .restart().
hermes_cli/setup.py +30 -4
hermes_cli/gateway.py +28 -4
## Two residual Windows fixes that were hanging from earlier commits.
### 1. `hermes gateway status` reported 2 PIDs per gateway — TWO bugs compounded
Diagnosed with psutil parent/child walk against live gateway PIDs:
**Bug A (the real one): `_get_parent_pid` silently failed on Windows.**
The helper shelled out to `ps -o ppid= -p <pid>`, which doesn't exist
on Windows — `FileNotFoundError` → returns `None` → the ancestor walk
terminated at `os.getpid()` alone. Consequence: the PID table scan in
`_scan_gateway_pids` couldn't filter out `hermes gateway status`'s own
launcher stub (a venv `pythonw.exe`/`python.exe` that matches the same
`-m hermes_cli.main gateway` pattern as the gateway). Every status
call saw "itself" as a second gateway.
Fix: `_get_parent_pid` now calls `psutil.Process(pid).ppid()` first
(psutil is a core dependency since 3dfb35700) and falls back to `ps`
only when `shutil.which("ps")` succeeds — matching the Windows-footgun
checker's "always guard `ps` / `wmic` / etc. with `shutil.which`" rule.
Before: `Gateway process running (PID: 21952, 46880)` — 46880 changing
on every call (the status invocation's own launcher, which died by the
time the next status call looked).
After (5 consecutive calls):
```
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
✓ Gateway process running (PID: 21952)
```
Ancestor walk on the fix: 14 PIDs (full chain through bash/explorer)
instead of the broken 1-PID set.
**Bug B (the cosmetic one): venv-launcher dedup.** Standard Windows
CPython venv behaviour is that `<venv>/Scripts/pythonw.exe` is a ~5 MB
launcher stub that spawns the base Python (`C:\\Program Files\\Python311
\\pythonw.exe`) with the same command line and waits. Our process
scanner sees two PIDs for every gateway: launcher + interpreter, same
cmdline. Bug A masked this by accidentally counting the status call
AS one of them; with Bug A fixed, we see both the real launcher and
real interpreter for the gateway process itself.
Fix: `_filter_venv_launcher_stubs` at the tail of `_scan_gateway_pids`
walks each matched PID's ppid via psutil. Any PID that's the PARENT
of another matched PID is a launcher stub — drop it, keep the child.
Scoped to Windows (`is_windows() and len(pids) > 1`) and no-ops when
psutil isn't importable.
Net effect: `gateway status` now reports one PID per gateway — the
interpreter — matching POSIX behaviour and user expectations.
### 2. `install.ps1`: bootstrap pip + auto-install platform SDKs
New `Install-PlatformSdks` function wired between `Invoke-SetupWizard`
and `Start-GatewayIfConfigured`. Fixes two related issues on fresh
Windows installs:
1. The tiered `uv pip install` cascade (introduced in 87fca8342)
correctly falls through when tier 1 `.[all]` fails on the RL git
deps, but the fallback tiers can silently skip SDKs from `[messaging]`
when there's a partial-resolve. Result: user sets `DISCORD_BOT_TOKEN`
in `.env`, fires up gateway, hits "discord module not installed".
2. `uv` creates venvs WITHOUT pip by default, so the user's escape
hatch (`pip install discord.py` in the venv) doesn't exist either.
The new function:
- Skips if `-NoVenv` (nothing to bootstrap into).
- Scans `~/.hermes/.env` for messaging tokens (TELEGRAM_BOT_TOKEN,
DISCORD_BOT_TOKEN, SLACK_BOT_TOKEN, SLACK_APP_TOKEN, WHATSAPP_ENABLED),
filtering placeholder values.
- For each token that's set, runs `python -c "import <sdk>"` to verify.
- If any import fails: runs `python -m ensurepip --upgrade` to bootstrap
pip into the venv (idempotent — no-ops if pip is already present),
then `pip install <spec>` for each missing SDK with specs mirroring
pyproject.toml's `[messaging]` extra to avoid version drift.
The `$ErrorActionPreference = "SilentlyContinue"` spans are not
cosmetic — PowerShell wraps native-stderr from a non-zero-exit
subprocess as a `NativeCommandError` that prints even through
`*> $null` / `2>$null`. Save + restore EAP over the import-probe
and pip-install blocks keeps the output clean.
Verified on this Windows 10 box:
- Initial state: telegram+fastapi+psutil present, discord+slack_sdk
missing (tier 1 `.[all]` had failed — `.tirith-install-failed`
marker in `%LOCALAPPDATA%\\hermes`).
- First run with discord+slack tokens in .env: detects both missing,
ensurepip (skipped — pip was already bootstrapped earlier this
session for telegram), installs `discord.py[voice]==2.7.1` +
`PyNaCl` + `davey`, installs `slack-sdk==3.41.0`. All imports
succeed on verify.
- Second run: all three SDKs report OK, function no-ops.
Pip spec strings mirror pyproject.toml's `[messaging]` extra verbatim
so a bump to the extra picks up here automatically — no drift.
### Files
- `hermes_cli/gateway.py`: `_get_parent_pid` rewritten (psutil-first);
`_filter_venv_launcher_stubs` added; `_scan_gateway_pids` dedups
launchers on Windows when it finds >1 match.
- `scripts/install.ps1`: new `Install-PlatformSdks` function (~85
lines); wired into the main flow at line 1438.
### Verification
- `venv/Scripts/python.exe scripts/check-windows-footguns.py --all`
→ `✓ No Windows footguns found (380 file(s) scanned).`
- `ast.parse` passes on gateway.py.
- `[System.Management.Automation.Language.Parser]::ParseFile` passes
on install.ps1.
- Live gateway (PID 21952, running since 12:33 today) survived 5x
stress loop of `hermes gateway status` without dying.
## Why
Hermes supports Linux, macOS, and native Windows, but the codebase grew up
POSIX-first and has accumulated patterns that silently break (or worse,
silently kill!) on Windows:
- `os.kill(pid, 0)` as a liveness probe — on Windows this maps to
CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console
process group (bpo-14484, open since 2012).
- `os.killpg` — doesn't exist on Windows at all (AttributeError).
- `os.setsid` / `os.getuid` / `os.geteuid` — same.
- `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr
errors at runtime on Windows.
- `open(path)` / `open(path, "r")` without explicit encoding= — inherits
the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX),
causing mojibake round-tripping between hosts.
- `wmic` — removed from Windows 10 21H1+.
This commit does three things:
1. Makes `psutil` a core dependency and migrates critical callsites to it.
2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that
blocks new instances of any of the above patterns.
3. Fixes every existing instance in the codebase so the baseline is clean.
## What changed
### 1. psutil as a core dependency (pyproject.toml)
Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical
cross-platform answer for "is this PID alive" and "kill this process
tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on
Windows (NOT a signal call), and its `Process.children(recursive=True)`
+ `.kill()` combo replaces `os.killpg()` portably.
### 2. `gateway/status.py::_pid_exists`
Rewrote to call `psutil.pid_exists()` first, falling back to the
hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows
(and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing —
e.g. during the scaffold phase of a fresh install before pip finishes.
### 3. `os.killpg` migration to psutil (7 callsites, 5 files)
- `tools/code_execution_tool.py`
- `tools/process_registry.py`
- `tools/tts_tool.py`
- `tools/environments/local.py` (3 sites kept as-is, suppressed with
`# windows-footgun: ok` — the pgid semantics psutil can't replicate,
and the calls are already Windows-guarded at the outer branch)
- `gateway/platforms/whatsapp.py`
### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines)
Grep-based checker with 11 rules covering every Windows cross-platform
footgun we've hit so far:
1. `os.kill(pid, 0)` — the silent killer
2. `os.setsid` without guard
3. `os.killpg` (recommends psutil)
4. `os.getuid` / `os.geteuid` / `os.getgid`
5. `os.fork`
6. `signal.SIGKILL`
7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT`
8. `subprocess` shebang script invocation
9. `wmic` without `shutil.which` guard
10. Hardcoded `~/Desktop` (OneDrive trap)
11. `asyncio.add_signal_handler` without try/except
12. `open()` without `encoding=` on text mode
Features:
- Triple-quoted-docstring aware (won't flag prose inside docstrings)
- Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments)
- Guard-hint aware (skips lines with `hasattr(os, ...)`,
`shutil.which(...)`, `if platform.system() != 'Windows'`, etc.)
- Inline suppression with `# windows-footgun: ok — <reason>`
- `--list` to print all rules with fixes
- `--all` / `--diff <ref>` / staged-files (default) modes
- Scans 380 files in under 2 seconds
### 5. CI integration
A GitHub Actions workflow that runs the checker on every PR and push is
staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this
commit because the GH token on the push machine lacks `workflow` scope.
A maintainer with `workflow` permissions should add it as
`.github/workflows/windows-footguns.yml` in a follow-up. Content:
```yaml
name: Windows footgun check
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.11"}
- run: python scripts/check-windows-footguns.py --all
```
### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion
Expanded from 5 to 16 rules, each with message, example, and fix.
Recommends psutil as the preferred API for PID / process-tree operations.
### 7. Baseline cleanup (91 → 0 findings)
- 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or
`encoding='utf-8-sig'` (user-editable files that Notepad may BOM)
- 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin
tool subprocess management → annotated with
`# windows-footgun: ok — <reason>`
- 7 `os.killpg` sites → migrated to psutil (see §3 above)
## Verification
```
$ python scripts/check-windows-footguns.py --all
✓ No Windows footguns found (380 file(s) scanned).
$ python -c "from gateway.status import _pid_exists; import os
> print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))"
self: True
bogus: False
```
Proof-of-repro that `os.kill(pid, 0)` was actually killing processes
before this fix — see commit `1cbe39914` and bpo-14484. This commit
removes the last hand-rolled ctypes path from the hot liveness-check
path and defers to the best-maintained cross-platform answer.
On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's
implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0
as ``CTRL_C_EVENT`` because the two integer values collide at the C
layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` —
which sends a Ctrl+C to the ENTIRE console process group containing
the target PID, not just the PID itself. Any caller that wanted to
check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)``
idiom was silently killing that process (and often unrelated
processes in the same console group) on Windows. Long-standing
Python Windows quirk; see bpo-14484 (open since 2012).
This manifested in Hermes as: every ``hermes gateway status``
invocation would read the gateway's PID from the PID file, call
``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a
"liveness check", and instantly terminate the gateway it was trying
to report on. No shutdown log, no traceback, no atexit hook fire,
no exit-diag entry — just silent termination of the detached pythonw
process. "Bot answered one message then stopped typing" was the
characteristic end-user symptom because `os.kill(pid, 0)` fires
mid-response-send and kills the gateway between logs.
Reproduction (verified in this branch before the fix):
$ hermes gateway start # gateway alive, PID 37520
$ hermes gateway status # reports "No gateway process detected"
$ tasklist /FI "PID eq 37520" # INFO: No tasks are running
# — gateway terminated silently
Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper:
- On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION |
SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)``
via ctypes. Zero signal delivery, zero console-group side effects.
Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs
on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER
(PID gone) from ERROR_ACCESS_DENIED (alive but another user).
- On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is
a no-op there.
Then patch every ``os.kill(pid, 0)`` liveness-check callsite to
route through ``_pid_exists`` instead. Total 14 callsites across
11 files; every single one was a latent silent-kill on Windows:
gateway/run.py:2810 — /restart watcher (inline subprocess)
gateway/run.py:15195 — --replace wait loop
gateway/status.py:572 — acquire_gateway_runtime_lock stale check
gateway/status.py:828 — get_running_pid (THE killer for status)
gateway/platforms/whatsapp.py:111
hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops
hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to
be cross-platform but used
os.kill(pid, 0) on Windows
hermes_cli/main.py:5792 — CLI process-kill polling
hermes_cli/profiles.py:782 — profile stop wait loop
plugins/google_meet/process_manager.py:74
tools/browser_tool.py:1215, 1255 — browser daemon ownership probes
tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking
The watcher source in gateway/run.py:2810 is a multi-line string
that gets spawned as an inline ``python -c "..."`` subprocess, so
it can't import gateway.status. The fix for that callsite inlines
the same ctypes probe directly into the watcher source.
Tested on Windows 10 with the hermes gateway + Telegram bot:
- gateway start → alive
- 5 consecutive ``hermes gateway status`` invocations → gateway
alive after every one, same PID reported each time (37520, 21952)
- gateway.log shows uninterrupted operation; no spurious shutdown
entries; cron ticker and kanban dispatcher still running on
their 60-second cadence
- bot continues answering Telegram messages throughout
Ships alongside an exit-path diagnostic wrapper in
``hermes_cli/gateway.py::run_gateway()`` that captures every way
``asyncio.run(start_gateway(...))`` can return (success, SystemExit,
KeyboardInterrupt, BaseException, atexit) with full traceback to
``logs/gateway-exit-diag.log``. This was used to prove the gateway
was being hard-killed externally (no exit event fired) and should
be kept for future Windows debugging.
Refs: https://bugs.python.org/issue14484
See also: references/windows-subprocess-sigint-storm.md in
the hermes-agent skill.
Hermes gateway now installs as a real Windows service via
`hermes gateway install`, auto-starts on user logon, and stays running
across reboots. Mirrors the launchd (macOS) / systemd (Linux) contract
so the rest of the CLI dispatcher just plugs into the same `install /
uninstall / start / stop / restart / status` entrypoints.
Primary implementation is the new `hermes_cli/gateway_windows.py`:
- `schtasks /Create /SC ONLOGON /RL LIMITED /RU <user> /NP /IT` creates
a per-user Scheduled Task running as the current user at next logon,
with no UAC prompt and no stored password. Same pattern OpenClaw uses.
- When `schtasks /Create` returns "Access is denied" or times out
(locked-down corporate boxes, 15s/30s hard + no-output cutoffs),
fall back to writing a `.cmd` file into
`%APPDATA%\Microsoft\Windows\Start Menu\Programs\Startup\`, which
Windows Explorer fires at every logon. Either path produces the same
end-user experience.
- `_spawn_detached()` launches `pythonw.exe -m hermes_cli.main gateway
run --replace` directly with `DETACHED_PROCESS |
CREATE_NEW_PROCESS_GROUP | CREATE_NO_WINDOW |
CREATE_BREAKAWAY_FROM_JOB` + DEVNULL stdio + sidecar
`logs/gateway-stdio.log`. Going through pythonw.exe (no console)
instead of a cmd.exe shim is what lets the gateway survive the
spawning shell's exit on Windows — documented in
`references/windows-subprocess-sigint-storm.md`.
- Two separate quoting helpers for cmd.exe vs schtasks (`/TR` argument)
— they're different parsers and mixing breaks both. Same split
OpenClaw documents in src/daemon/schtasks.ts.
- `_wait_for_gateway_ready()` + `_report_gateway_start()` poll for a
live gateway process after spawn and report the PID, so install
doesn't lie about success.
Dispatcher wiring in `hermes_cli/gateway.py`:
- `_gateway_command_inner()` gets Windows branches for install /
uninstall / start / stop / restart / status + `_is_service_installed`
+ `_is_service_running`. `gateway status` output + suggested
commands now mention `hermes gateway install` instead of
`sudo hermes gateway install --system` on Windows.
Two separable Windows fixes that only matter for a working
detached gateway, bundled here because shipping them independently
leaves install broken:
(1) Spurious CTRL_C_EVENT on detached pythonw runs. When the gateway
is launched detached on Windows, something on the boot path (HTTPX /
python-telegram-bot / asyncio ProactorEventLoop subprocess plumbing)
synthesizes a Ctrl+C within ~60-90 seconds. Python 3.11 translates it
into KeyboardInterrupt inside `asyncio.run(start_gateway(...))`, the
outer `except KeyboardInterrupt: return` exits cleanly, and the
process dies with no shutdown log — "bot started typing, then
stopped" is the fingerprint because the interrupt fires mid-send.
Fix in `run_gateway()`: when `is_windows()` and stdin is not a TTY,
install `signal.signal(SIGINT, SIG_IGN)` + same for SIGBREAK. Real
console runs have a TTY and skip the absorber, so user Ctrl+C still
works interactively. Same family as commit 449ad952b's browser-tool
SIGINT absorber; cross-referenced in the ref doc.
(2) `wmic process get` is the process-list path used by
`_scan_gateway_pids()` / `find_gateway_pids()`, which power status,
stop, and restart on Windows. `C:\Windows\System32\wbem\WMIC.exe` has
been deprecated since Windows 10 21H1 and is not installed on modern
Win 10/11 boxes, so `find_gateway_pids()` silently returns [] — status
sees no gateway even when one is running. Fix: `shutil.which("wmic")`
first, fall back to PowerShell's `Get-CimInstance Win32_Process`
emitting the same LIST-style `CommandLine=...` / `ProcessId=...` pairs
the downstream parser already handles. Zero behavior change on boxes
where wmic still works.
Verified end-to-end on Windows 10 (Delta-1):
- `hermes gateway install` → falls back to Startup folder (access
denied on schtasks for this user) + detached pythonw spawn, PID
reported correctly.
- Gateway connects to Telegram, answers messages, stays alive past
2min (previously died at ~85s with no shutdown log).
- `hermes gateway stop` + `uninstall` both clean up both tracks.
Refs: openclaw/openclaw src/daemon/schtasks.ts for the ONLOGON +
startup-folder-fallback pattern. skill hermes-agent
references/windows-subprocess-sigint-storm.md for the deeper
CTRL_C_EVENT / ProactorEventLoop background.
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.
Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.
## New module
- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
All no-ops on non-Windows.
## CRITICAL fixes (would crash or silently break on Windows)
- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
AttributeError on import on Windows, breaking `hermes --tui` entirely (it
spawns this module as a subprocess). Guard each signal.signal() call with
hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.
- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop
behind `os.name != "nt"` — Windows has no zombies anyway.
- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0
ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts
either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
Generated sandbox client parses both.
- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use
shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
readable error when bash is genuinely absent.
- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
(npm install + node version probe), browser_tool.py x2. On Windows npm
is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
fails with WinError 193. shutil.which(...) returns the absolute .cmd
path which CreateProcessW accepts because the extension routes through
cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the
same path subprocess would resolve itself).
## HIGH fixes (silent misbehaviour on Windows)
- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd
tracking silently broken — `cd` in terminal tool did nothing. Windows
branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
(works in both bash and Python, guaranteed no spaces).
- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
in split(":")` heuristic mangles Windows PATH (";" separator). Gate
the injection behind `not _IS_WINDOWS`.
- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
Popen + watcher-script Popen both used start_new_session=True, which
Windows silently ignores. Watcher stayed attached to CLI's console,
died when user closed terminal after `hermes update`, left gateway
stale. Now branches through windows_detach_popen_kwargs() helper
(CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
Windows, start_new_session=True on POSIX — identical to main).
## MEDIUM fixes
- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
chain crashes on Windows when user triggers /update in-gateway. Now
has sys.platform=="win32" branch using sys.executable + a tiny
Python watcher with proper detach flags. POSIX path is unchanged.
- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
style paths that break subprocess.Popen(cwd=...) and Path().resolve().
Added _normalize_git_bash_path() helper that translates /c/Users,
/cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op.
_git_repo_root() now routes every result through it.
- cli.py worktree .worktreeinclude: os.symlink on directories failed
hard on Windows (requires admin or Developer Mode). Falls back to
shutil.copytree with a warning log.
## Tests
- 29 new tests in tests/tools/test_windows_native_support.py covering:
subprocess_compat helpers, TUI entry signal guards, kanban waitpid
guard, code_execution TCP fallback source-level invariants, cron bash
resolution, npm/npx bare-spawn lint per-file, local env Windows temp
dir, PATH injection gating, git bash path normalization, symlink
fallback, gateway detached watcher flags.
- One existing test assertion adjusted in test_browser_homebrew_paths:
it compared captured Popen argv to the BARE `"npx"` literal; after the
shutil.which() change argv[0] is the absolute path. New assertion
checks the shape (two items, second is `agent-browser`) rather than
the exact first-item string. Behaviour unchanged; test was too strict.
All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.
## What's still deferred (LOW priority)
- Visible cmd-window flashes on short-lived console apps (~14 sites) —
cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
reachable only when all env-var candidates fail.
Native Windows (with Git for Windows installed) can now run the Hermes CLI
and gateway end-to-end without crashing. install.ps1 already existed and
the Git Bash terminal backend was already wired up — this PR fills the
remaining gaps discovered by auditing every Windows-unsafe primitive
(`signal.SIGKILL`, `os.kill(pid, 0)` probes, bare `fcntl`/`termios`
imports) and by comparing hermes against how Claude Code, OpenCode, Codex,
and Cline handle native Windows.
## What changed
### UTF-8 stdio (new module)
- `hermes_cli/stdio.py` — single `configure_windows_stdio()` entry point.
Flips the console code page to CP_UTF8 (65001), reconfigures
`sys.stdout`/`stderr`/`stdin` to UTF-8, sets `PYTHONIOENCODING` + `PYTHONUTF8`
for subprocesses. No-op on non-Windows. Opt out via `HERMES_DISABLE_WINDOWS_UTF8=1`.
- Called early in `cli.py::main`, `hermes_cli/main.py::main`, and
`gateway/run.py::main` so Unicode banners (box-drawing, geometric
symbols, non-Latin chat text) don't `UnicodeEncodeError` on cp1252
consoles.
### Crash sites fixed
- `hermes_cli/main.py:7970` (hermes update → stuck gateway sweep): raw
`os.kill(pid, _signal.SIGKILL)` → `gateway.status.terminate_pid(pid, force=True)`
which routes through `taskkill /T /F` on Windows.
- `hermes_cli/profiles.py::_stop_gateway_process`: same fix — also
converted SIGTERM path to `terminate_pid()` and widened OSError catch
on the intermediate `os.kill(pid, 0)` probe.
- `hermes_cli/kanban_db.py:2914, 3041`: raw `signal.SIGKILL` →
`getattr(signal, "SIGKILL", signal.SIGTERM)` fallback (matches the
pattern already used in `gateway/status.py`).
### OSError widening on `os.kill(pid, 0)` probes
Windows raises `OSError` (WinError 87) for a gone PID instead of
`ProcessLookupError`. Widened the catch at:
- `gateway/run.py:15101` (`--replace` wait-for-exit loop — without this,
the loop busy-spins the full 10s every Windows gateway start)
- `hermes_cli/gateway.py:228, 460, 940`
- `hermes_cli/profiles.py:777`
- `tools/process_registry.py::_is_host_pid_alive`
- `tools/browser_tool.py:1170, 1206`
### Dashboard PTY graceful degradation
`hermes_cli/pty_bridge.py` depends on `fcntl`/`termios`/`ptyprocess`,
none of which exist on native Windows. Previously a Windows dashboard
would crash on `import hermes_cli.web_server` because of a top-level
import. Now:
- `hermes_cli/web_server.py` wraps the pty_bridge import in
`try/except ImportError` and sets `_PTY_BRIDGE_AVAILABLE=False`.
- The `/api/pty` WebSocket handler returns a friendly "use WSL2 for
this tab" message instead of exploding.
- Every other dashboard feature (sessions, jobs, metrics, config
editor) runs natively on Windows.
### Dependency
- `pyproject.toml`: add `tzdata>=2023.3; sys_platform == 'win32'` so
Python's `zoneinfo` works on Windows (which has no IANA tzdata
shipped with the OS). Credits @sprmn24 (PR #13182).
### Docs
- README.md: removed "Native Windows is not supported"; added
PowerShell one-liner and Git-for-Windows prerequisite note.
- `website/docs/getting-started/installation.md`: new Windows section
with capability matrix (everything native except the dashboard
`/chat` PTY tab, which is WSL2-only).
- `website/docs/user-guide/windows-wsl-quickstart.md`: reframed as
"WSL2 as an alternative to native" rather than "the only way".
- `website/docs/developer-guide/contributing.md`: updated
cross-platform guidance with the `signal.SIGKILL` / `OSError`
rules we enforce now.
- `website/docs/user-guide/features/web-dashboard.md`: acknowledged
native Windows works for everything except the embedded PTY pane.
## Why this shape
Pulled from a survey of how other agent codebases handle native
Windows (Claude Code, OpenCode, Codex, Cline):
- All four treat Git Bash as the canonical shell on Windows, same as
hermes already does in `tools/environments/local.py::_find_bash()`.
- None of them force `SetConsoleOutputCP` — but they don't have to,
Node/Rust write UTF-16 to the Win32 console API. Python does not get
that for free, so we flip CP_UTF8 via ctypes.
- None of them ship PowerShell-as-primary-shell (Claude Code exposes
PS as a secondary tool; scope creep for this PR).
- All of them use `taskkill /T /F` for force-kill on Windows, which
is exactly what `gateway.status.terminate_pid(force=True)` does.
## Non-goals (deliberate scope limits)
- No PowerShell-as-a-second-shell tool — worth designing separately.
- No terminal routing rewrite (#12317, #15461, #19800 cluster) — that's
the hardest design call and needs a separate doc.
- No wholesale `open()` → `open(..., encoding="utf-8")` sweep (Tianworld
cluster) — will do as follow-up if users hit actual breakage; most
modern code already specifies it.
## Validation
- 28 new tests in `tests/tools/test_windows_native_support.py` — all
platform-mocked, pass on Linux CI. Cover:
- `configure_windows_stdio` idempotency, opt-out, env-preservation
- `terminate_pid` taskkill routing, failure → OSError, FileNotFoundError fallback
- `getattr(signal, "SIGKILL", …)` fallback shape
- `_is_host_pid_alive` OSError widening (Windows-gone-PID behavior)
- Source-level checks that all entry points call `configure_windows_stdio`
- pty_bridge import-guard present in `web_server.py`
- README no longer says "not supported"
- 12 pre-existing tests in `tests/tools/test_windows_compat.py` still pass.
- `tests/hermes_cli/` ran fully (3909 passed, 9 failures — all confirmed
pre-existing on main by stash-test).
- `tests/gateway/` ran fully (5021 passed, 1 pre-existing failure).
- `tests/tools/test_process_registry.py` + `test_browser_*` pass.
- Manual smoke: `import hermes_cli.stdio; import gateway.run;
import hermes_cli.web_server` — all clean, `_PTY_BRIDGE_AVAILABLE=True`
on Linux (as expected).
## Files
- New: `hermes_cli/stdio.py`, `tests/tools/test_windows_native_support.py`
- Modified: `cli.py`, `gateway/run.py`, `hermes_cli/main.py`,
`hermes_cli/profiles.py`, `hermes_cli/gateway.py`,
`hermes_cli/kanban_db.py`, `hermes_cli/pty_bridge.py`,
`hermes_cli/web_server.py`, `tools/browser_tool.py`,
`tools/process_registry.py`, `pyproject.toml`, `README.md`, and 4
docs pages.
Credits to everyone whose prior PR work informed these fixes — see
the co-author trailers. All of the PRs listed in
`~/.hermes/plans/windows-support-prs.md` fixing `os.kill` / `signal.SIGKILL`
/ UTF-8 stdio / tzdata / README patterns found the same issues; this PR
consolidates them.
Co-authored-by: Philip D'Souza <9472774+PhilipAD@users.noreply.github.com>
Co-authored-by: Arecanon <42595053+ArecaNon@users.noreply.github.com>
Co-authored-by: XiaoXiao0221 <263113677+XiaoXiao0221@users.noreply.github.com>
Co-authored-by: Lars Hagen <1360677+lars-hagen@users.noreply.github.com>
Co-authored-by: Luan Dias <65574834+luandiasrj@users.noreply.github.com>
Co-authored-by: Ruzzgar <ruzzgarcn@gmail.com>
Co-authored-by: sprmn24 <oncuevtv@gmail.com>
Co-authored-by: adybag14-cyber <252811164+adybag14-cyber@users.noreply.github.com>
Co-authored-by: Prasanna28Devadiga <54196612+Prasanna28Devadiga@users.noreply.github.com>
Add a new `hermes gateway list` subcommand that shows the running
status of gateways across all profiles in a single view:
Gateways:
✓ default (current) — PID 155469
✓ wx1 — PID 166893
✗ dev — not running
Also includes `_print_other_profiles_gateway_status()` which appends
an "Other profiles" section to `hermes gateway status` output when
other profile gateways are running.
Both use existing `list_profiles()` and `find_profile_gateway_processes()`
— no new dependencies.
Closes#19127
Related: #19113, #4402, #4587
Widen PR #20314's fix to the other timeout-polling sites in the codebase
that share the same wall-clock-jump bug class. All of these measure elapsed
timeout duration, not civil time, so they belong on time.monotonic().
- hermes_cli/auth.py: auth-store file-lock timeout, Spotify OAuth callback
wait, Nous portal device-auth token poll.
- hermes_cli/copilot_auth.py: Copilot OAuth device-flow token poll.
- hermes_cli/gateway.py: gateway systemd restart wait.
- hermes_cli/web_server.py: dashboard Codex device-auth user_code wait,
dashboard Nous device-auth token poll. (sess["expires_at"] stays on
time.time() — it's a persisted absolute timestamp, not a local
deadline-polling variable.)
- agent/copilot_acp_client.py: Copilot ACP JSON-RPC request timeout.
The setup wizard dropped non-root users at a bare shell prompt when
trying to start a system-scope gateway service. Previously
_require_root_for_system_service called sys.exit(1), which the
wizard's `except Exception` guards cannot catch (SystemExit is a
BaseException). Users with a pre-existing /etc/systemd/system unit
(e.g. from an earlier `sudo hermes setup` run) hit this whenever
they re-ran `hermes setup` as a regular user.
- Convert _require_root_for_system_service to raise a typed
SystemScopeRequiresRootError (RuntimeError subclass) instead of
sys.exit(1). The direct CLI path (`hermes gateway install|start|stop|
restart|uninstall` without sudo) still exits 1 cleanly via a new
catch at the top of gateway_command, matching the existing
UserSystemdUnavailableError pattern.
- Add _system_scope_wizard_would_need_root() pre-check and
_print_system_scope_remediation() helper. Both setup wizards
(hermes_cli/setup.py and hermes_cli/gateway.py::gateway_setup) now
detect the dead-end before prompting and print actionable guidance:
either `sudo systemctl start <service>` this time, or uninstall the
system unit and install a per-user one.
- Defense-in-depth: all 5 wizard prompt sites also catch
SystemScopeRequiresRootError and fall back to the remediation
helper if the pre-check is bypassed (race, etc.).
Tests: 12 new tests in TestSystemScopeRequiresRootError,
TestSystemScopeWizardPreCheck, TestSystemScopeRemediationOutput, and
TestGatewayCommandCatchesSystemScopeError covering the exception
contract, pre-check matrix (root vs non-root, system-only vs
user-present vs none vs explicit system=True), remediation output
for each action, and the direct-CLI exit-1 path.
Instead of an unhelpful CalledProcessError traceback when running
`hermes gateway start/stop/restart` without first installing the service,
check for the unit file and exit with an actionable install hint.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The resilient restart settings from PR #18639 only took effect when
the gateway was started via `hermes gateway start` or `hermes gateway
restart` — both of which call refresh_systemd_unit_if_needed() which
writes the new unit and runs daemon-reload.
However, when the gateway self-restarts via exit-code-75 (stale-code
detection after `hermes update`, or the /restart command), systemd
respawns the process directly without going through any CLI function.
The unit file on disk stays stale, and systemd keeps using the old
cached settings (StartLimitBurst=5, RestartSec=30) until someone
manually runs `hermes gateway restart`.
This meant that after PR #18639 was deployed, users who never ran
`hermes gateway restart` manually were still vulnerable to the
permanent-death-on-network-outage bug.
Fix: call refresh_systemd_unit_if_needed() at the top of run_gateway()
(the foreground entry point that systemd's ExecStart invokes). This
ensures that on every boot — whether triggered by systemd restart,
exit-75 respawn, or manual foreground run — the unit definition and
daemon state are current. The call is best-effort (exceptions caught)
and a no-op when the unit is already current (one stat + string compare).
_scan_gateway_pids() uses ps-based pattern matching to find running
gateways. When invoked from the CLI (e.g. `hermes gateway status`),
the calling process itself matches gateway patterns, causing false
positives — the CLI is mistakenly counted as a running gateway.
Add _get_ancestor_pids() that walks the process tree from the current
PID up to init (PID 1). Merge this set into exclude_pids at the top
of _scan_gateway_pids() so the entire ancestor chain is filtered out.
This complements the existing os.getpid() exclusion in
_append_unique_pid() by also covering parent/grandparent processes
(e.g. when hermes is invoked via a wrapper script or shell).
Closes#13242
When multiple gateway profiles are running (e.g. default and wx1),
`hermes gateway status` can be misleading — stopping one profile's
gateway and checking status may still show the other profile's process
without indicating which profile it belongs to.
Add `_print_other_profiles_gateway_status()` which displays running
gateways from other profiles at the bottom of the status output:
Other profiles:
✓ wx1 — PID 166893
This uses the existing `find_profile_gateway_processes()` and
`get_active_profile_name()` — no new dependencies.
Closes#19113
Related: #4402, #4587
The old defaults (StartLimitIntervalSec=600, StartLimitBurst=5,
RestartSec=30) meant any network outage over ~5 minutes would
permanently kill the gateway until manual intervention.
Changes:
- StartLimitIntervalSec=0 (never give up)
- Restart=always (not just on-failure)
- RestartSec=60 with RestartMaxDelaySec=300, RestartSteps=5
(exponential backoff: 60 → 120 → 180 → 240 → 300s cap)
- After=network-online.target + Wants= (both units now wait for
actual connectivity, not just network.target)
Power outage → internet down → internet back = auto-recovery.
Platform plugins shipped in-repo under plugins/platforms/ should be
available out of the box — users shouldn't have to add 'irc-platform'
to plugins.enabled before they can pick IRC from the gateway setup menu.
Adds a new ``kind: platform`` plugin type that mirrors the existing
``kind: backend`` auto-load semantics:
- Bundled (shipped in the hermes-agent repo): auto-load unconditionally.
- User-installed (~/.hermes/plugins/): still opt-in via plugins.enabled
so untrusted code doesn't silently run.
Changes:
* hermes_cli/plugins.py: add 'platform' to _VALID_PLUGIN_KINDS, document
the new kind in the PluginManifest docstring, extend the bundled auto-
load rule from 'backend only' to 'backend or platform'.
* plugins/platforms/irc/plugin.yaml: declare kind: platform.
* hermes_cli/gateway.py: remove the now-redundant
_load_bundled_platform_plugins_for_enumeration() helper and the
_enable_plugin_for_platform() helper. The setup menu's _all_platforms()
just calls discover_plugins() and reads the registry — bundled
platforms are already loaded at that point. Drops the 'needs_enable'
flag and the 'plugin disabled — select to enable' status string.
* hermes_cli/setup.py: relax the "gateway is configured" detector used
during OpenClaw migration. Switching to _platform_status() in an
earlier commit tightened the check to require an exact "configured"
match, dropping platforms whose status is "enabled, not paired",
"partially configured", "configured + E2EE", etc. Now any non-"not
configured" status counts — the user has already started setup there
and we shouldn't force the section to rerun.
* tests/hermes_cli/test_setup_irc.py: drop the TestIRCPluginDisabledFlow
class and test_configure_platform_enables_disabled_plugin_first — the
no-longer-existent flow they were testing.
* tests/hermes_cli/test_setup_openclaw_migration.py: patch both
setup.get_env_value and gateway.get_env_value in the 4 gateway-section
tests that reach _platform_status() through the unified setup flow;
switch WHATSAPP_ENABLED to the literal "true" in the registry-parity
test so WhatsApp's value-shape validator matches.
Verified via fresh-install smoke (empty plugins.enabled, no env vars):
IRC plugin loads, Platform('irc') resolves, _all_platforms() lists IRC
with status 'not configured'. 160 targeted tests pass.
feat(gateway): refine Platform._missing_ and platform-connected dispatch
Restricts plugin-name acceptance to bundled plugin scan + registry
(no arbitrary string -> enum-pollution), pulls per-platform connectivity
checks into a _PLATFORM_CONNECTED_CHECKERS lambda map with a clean
_is_platform_connected method, and adds tests covering the checker map,
plugin platform interface, and IRC setup wizard.
Merge the two gateway setup paths (hermes setup gateway + hermes gateway
setup) to use a single _unified_platforms() list that merges built-in
_PLATFORMS with dynamically registered plugin entries from
platform_registry.
- Add setup_fn field to PlatformEntry for plugin setup flows
- _unified_platforms() merges built-ins with registry entries by key
- setup_gateway() now uses unified list instead of hardcoded
_GATEWAY_PLATFORMS tuple list
- gateway_setup() uses same unified list, plugin entries appear
alongside built-ins with no [plugin] suffix
- _platform_status() handles plugin platforms via registry check_fn
- Plugin platforms with setup_fn get called directly; plugins without
get a generic env-var display fallback
IRC and other plugin platforms now appear automatically in the setup
menu when registered via platform_registry.register().
feat(gateway): surface disabled platform plugins in setup and auto-enable on select
Platform plugins under plugins/platforms/* (IRC, etc.) were gated behind
plugins.enabled, so `hermes gateway setup` wouldn't list them until the
user ran `hermes plugins enable <name>` first. Now the setup menu always
surfaces them as "plugin disabled — select to enable", and picking one
adds it to plugins.enabled before running its setup flow.
Along the way, unify the two gateway setup flows so `hermes setup gateway`
and `hermes gateway setup` both read from the same platform list (built-in
_PLATFORMS + platform_registry entries), dispatch through a single
_configure_platform() helper, and share _platform_status(). Deletes the
dead bespoke wrappers in setup.py (_setup_whatsapp, _setup_weixin,
_setup_email, etc.) that duplicated logic now covered by the registry
path or _setup_standard_platform.
Also:
- PlatformEntry gains a plugin_name field so the registry knows which
plugin owns each entry (required for auto-enable).
- PluginContext.register_platform auto-stamps plugin_name from the
manifest so plugins don't have to pass it explicitly.
- PluginManager now scans plugins/platforms/* as its own category root,
one level below the bundled plugin scan.
- Fix IRC plugin discovery: rename PLUGIN.yaml → plugin.yaml (the
scanner is case-sensitive) and add the missing __init__.py that
_load_directory_module requires.
Extends the platform plugin interface from Phase 1 to cover every
touchpoint where built-in platforms have hardcoded behavior.
- allowed_users_env / allow_all_env: per-platform auth env vars
- max_message_length: smart-chunking for send_message tool
- pii_safe: session PII redaction flag
- emoji: CLI/gateway display
- allow_update_command: /update access control
send_message tool (tools/send_message_tool.py):
- Replaced hardcoded platform_map dict with Platform() call
- Added _send_via_adapter() for plugin platforms — routes through
live gateway adapter when available
- Registry-aware max message length for smart chunking
Cron delivery (cron/scheduler.py):
- Replaced hardcoded 15-entry platform_map with Platform() call
- Plugin platforms now work as cron delivery targets
User authorization (gateway/run.py _is_user_authorized):
- Registry fallback: checks PlatformEntry.allowed_users_env and
allow_all_env when platform not in hardcoded maps
- Plugin platforms get per-platform auth support
_UPDATE_ALLOWED_PLATFORMS: checks registry allow_update_command flag
Channel directory: includes plugin platforms in session enumeration
Orphaned config warning: descriptive message when plugin platform is
in config but no plugin registered it
Gateway weakref: _gateway_runner_ref for cross-module adapter access
hermes status: shows plugin platforms with (plugin) tag
hermes gateway setup: plugin platforms appear in menu with setup hints
hermes_cli/platforms.py: get_all_platforms() merges with registry,
platform_label() falls back to registry for plugin names
- 8 new tests (extended fields, cron resolution, platforms merge)
- Updated 3 tests for new Platform() based resolution
- 2829 passed, 24 pre-existing failures, zero new failures