The /model picker called provider_model_ids() which fetches the FULL
live API catalog (hundreds of models for Anthropic, Copilot, etc.) and
only fell back to the curated list when the live fetch failed.
This flips the priority: use the curated model list from
list_authenticated_providers() (same lists as `hermes model` and
gateway pickers), falling back to provider_model_ids() only when the
curated list is empty (e.g. user-defined endpoints).
hermes doctor now checks whether the ~/.local/bin/hermes symlink exists
and points to the correct venv entry point. With --fix, it creates or
repairs the symlink automatically.
Covers:
- Missing symlink at ~/.local/bin/hermes (or $PREFIX/bin on Termux)
- Symlink pointing to wrong target
- Missing venv entry point (venv/bin/hermes or .venv/bin/hermes)
- PATH warning when ~/.local/bin is not on PATH
- Skipped on Windows (different mechanism)
Addresses user report: 'python -m hermes_cli.main doesn't have an option
to fix the local bin/install'
10 new tests covering all scenarios.
On some Python versions, argparse fails to route subcommand tokens when
the parent parser has nargs='?' optional arguments (--continue). The
symptom: 'hermes model' produces 'unrecognized arguments: model' even
though 'model' is a registered subcommand.
Fix: when argv contains a token matching a known subcommand, set
subparsers.required=True to force deterministic routing. If that fails
(e.g. 'hermes -c model' where 'model' is consumed as the session name
for --continue), fall back to the default optional-subparsers behaviour.
Adds 13 tests covering all key argument combinations.
Reported via user screenshot showing the exact error on an installed
version with the model subcommand listed in usage but rejected at parse
time.
Four independent fixes:
1. Reset activity timestamp on cached agent reuse (#9051)
When the gateway reuses a cached AIAgent for a new turn, the
_last_activity_ts from the previous turn (possibly hours ago)
carried over. The inactivity timeout handler immediately saw
the agent as idle for hours and killed it.
Fix: reset _last_activity_ts, _last_activity_desc, and
_api_call_count when retrieving an agent from the cache.
2. Detect uv-managed virtual environments (#8620 sub-issue 1)
The systemd unit generator fell back to sys.executable (uv's
standalone Python) when running under 'uv run', because
sys.prefix == sys.base_prefix. The generated ExecStart pointed
to a Python binary without site-packages.
Fix: check VIRTUAL_ENV env var before falling back to
sys.executable. uv sets VIRTUAL_ENV even when sys.prefix
doesn't reflect the venv.
3. Nudge model to continue after empty post-tool response (#9400)
Weaker models sometimes return empty after tool calls. The agent
silently abandoned the remaining work.
Fix: append assistant('(empty)') + user nudge message and retry
once. Resets after each successful tool round.
4. Compression model fallback on permanent errors (#8620 sub-issue 4)
When the default summary model (gemini-3-flash) returns 503
'model_not_found' on custom proxies, the compressor entered a
600s cooldown, leaving context growing unbounded.
Fix: detect permanent model-not-found errors (503, 404,
'model_not_found', 'no available channel') and fall back to
using the main model for compression instead of entering
cooldown. One-time fallback with immediate retry.
Test plan: 40 compressor tests + 97 gateway/CLI tests + 9 venv tests pass
The existing recovery block sanitized self.api_key and
self._client_kwargs['api_key'] but did not update self.client.api_key.
The OpenAI SDK stores its own copy of api_key and reads it dynamically
via the auth_headers property on every request. Without this fix, the
retry after sanitization would still send the corrupted key in the
Authorization header, causing the same UnicodeEncodeError.
The bug manifests when an API key contains Unicode lookalike characters
(e.g. ʋ U+028B instead of v) from copy-pasting out of PDFs, rich-text
editors, or web pages with decorative fonts. httpx hard-encodes all
HTTP headers as ASCII, so the non-ASCII char in the Authorization
header triggers the error.
Adds TestApiKeyClientSync with two tests verifying:
- All three key locations are synced after sanitization
- Recovery handles client=None (pre-init) without crashing
Three independent fixes:
1. Reset activity timestamp on cached agent reuse (#9051)
When the gateway reuses a cached AIAgent for a new turn, the
_last_activity_ts from the previous turn (possibly hours ago)
carried over. The inactivity timeout handler immediately saw
the agent as idle for hours and killed it.
Fix: reset _last_activity_ts, _last_activity_desc, and
_api_call_count when retrieving an agent from the cache.
2. Detect uv-managed virtual environments (#8620 sub-issue 1)
The systemd unit generator fell back to sys.executable (uv's
standalone Python) when running under 'uv run', because
sys.prefix == sys.base_prefix (uv doesn't set up traditional
venv activation). The generated ExecStart pointed to a Python
binary without site-packages, crashing the service on startup.
Fix: check VIRTUAL_ENV env var before falling back to
sys.executable. uv sets VIRTUAL_ENV even when sys.prefix
doesn't reflect the venv.
3. Nudge model to continue after empty post-tool response (#9400)
Weaker models (GLM-5, mimo-v2-pro) sometimes return empty
responses after tool calls instead of continuing to the next
step. The agent silently abandoned the remaining work with
'(empty)' or used prior-turn fallback text.
Fix: when the model returns empty after tool calls AND there's
no prior-turn content to fall back on, inject a one-time user
nudge message telling the model to process the tool results and
continue. The flag resets after each successful tool round so it
can fire again on later rounds.
Test plan: 97 gateway + CLI tests pass, 9 venv detection tests pass
Previously, non-integer context_length values (e.g. '256K') in
config.yaml were silently ignored, causing the agent to fall back
to 128K auto-detection with no user feedback. This was confusing
for users with custom LiteLLM endpoints expecting larger context.
Now prints a clear stderr warning and logs at WARNING level when
model.context_length or custom_providers[].models.<model>.context_length
cannot be parsed as an integer, telling users to use plain integers
(e.g. 256000 instead of '256K').
Reported by community user ChFarhan via Discord.
When a user sends a message while the agent is executing a task on the
gateway, the agent is now interrupted immediately — not silently queued.
Previously, messages were stored in _pending_messages with zero feedback
to the user, potentially leaving them waiting 1+ hours.
Root cause: Level 1 guard (base.py) intercepted all messages for active
sessions and returned with no response. Level 2 (gateway/run.py) which
calls agent.interrupt() was never reached.
Fix: Expand _handle_active_session_busy_message to handle the normal
(non-draining) case:
1. Call running_agent.interrupt(text) to abort in-flight tool calls
and exit the agent loop at the next check point
2. Store the message as pending so it becomes the next turn once the
interrupted run returns
3. Send a brief ack: 'Interrupting current task (10 min elapsed,
iteration 21/60, running: terminal). I'll respond shortly.'
4. Debounce acks to once per 30s to avoid spam on rapid messages
Reported by @Lonely__MH.
- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence
Based on work by alanjds (PR #3996) and malaiwah (PR #8115).
Closes#4084.
When compression fails after max attempts, the agent returns
{completed: False, partial: True} but was missing the 'failed' flag.
The gateway's agent_failed_early guard checked for 'failed' AND
'not final_response', but _run_agent_blocking always converts errors
to final_response — making the guard dead code. This caused the
oversized session to persist, creating an infinite fail loop where
every subsequent message hits the same compression failure.
Changes:
- run_agent.py: add 'failed: True' and 'compression_exhausted: True'
to all 5 compression-exhaustion return paths
- gateway/run.py (_run_agent_blocking): forward 'failed' and
'compression_exhausted' flags through to the caller
- gateway/run.py (_handle_message_with_agent): fix agent_failed_early
to check bool(failed) without the broken 'not final_response' clause;
auto-reset the session when compression is exhausted so the next
message starts fresh
- Update tests to match new guard logic and add
TestCompressionExhaustedFlag test class
Closes#9893
The original tree-wide ast.walk() would match registry.register() calls
inside functions too. Restrict to top-level ast.Expr statements so helper
modules that call registry.register() inside a function are never picked
up as tool modules.
The /v1/responses endpoint generated a new UUID session_id for every
request, even when previous_response_id was provided. This caused each
turn of a multi-turn conversation to appear as a separate session on the
web dashboard, despite the conversation history being correctly chained.
Fix: store session_id alongside the response in the ResponseStore, and
reuse it when a subsequent request chains via previous_response_id.
Applies to both the non-streaming /v1/responses path and the streaming
SSE path. The /v1/runs endpoint also gains session continuity from
stored responses (explicit body.session_id still takes priority).
Adds test verifying session_id is preserved across chained requests.
* fix: hermes gateway restart waits for service to come back up (#8260)
Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.
Now the SIGUSR1 path blocks with progress feedback:
Phase 1 — wait for old process to die:
⏳ User service draining active work...
Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)
Phase 2 — wait for new process to become active:
⏳ Waiting for hermes-gateway to restart...
Polls systemctl is-active + verifies new PID (up to 60s)
Success:
✓ User service restarted (PID 12345)
Timeout:
⚠ User service did not become active within 60s.
Check status: hermes gateway status
Check logs: journalctl --user -u hermes-gateway --since '2 min ago'
The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.
Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass
* fix: add 402 billing error hint to gateway error handler (#5220)
The gateway's exception handler for agent errors had specific hints for
HTTP 401, 429, 529, 400, 500 — but not 402 (Payment Required / quota
exhausted). Users hitting billing limits from custom proxy providers
got a generic error with no guidance.
Added: 'Your API balance or quota is exhausted. Check your provider
dashboard.'
The underlying billing classification (error_classifier.py) already
correctly handles 402 as FailoverReason.billing with credential
rotation and fallback. The original issue (#5220) where 402 killed
the entire gateway was from an older version — on current main, 402
is excluded from the is_client_error abort path (line 9460) and goes
through the proper retry/fallback/fail flow. Combined with PR #9875
(auto-recover from unexpected SIGTERM), even edge cases where the
gateway dies are now survivable.
Three bugfixes in the agent loop:
1. Reset retry counters after context compression. Without this,
pre-compression retry counts carry over, causing the model to
hit empty-response recovery immediately after a compression-
induced context loss, wasting API calls on a now-valid context.
2. Unmute output in the final-response (no-tool-call) branch.
_mute_post_response could be left True from a prior housekeeping
turn, silently suppressing empty-response warnings and recovery
status that the user should see.
3. Stop injecting 'Calling the X tools...' into assistant message
content when falling back to prior-turn content. This mutated
conversation history with synthetic text that the model never
produced, poisoning subsequent turns.
- gateway start --all: kills all stale gateway processes across all
profiles before starting the current profile's service
- gateway restart --all: stops all gateway processes across all
profiles, then starts the current profile's service fresh
- gateway stop --all: already existed, unchanged
The --all flag was only available on 'stop' but not on 'start' or
'restart', causing 'unrecognized arguments' errors for users.
The streaming path emits output as content-part arrays for Open WebUI
compatibility, but the batch (non-streaming) Responses API path must
return output as a plain string per the OpenAI Responses API spec.
Reverts the _extract_output_items change from the cherry-picked commits
while preserving the streaming path's array format.
API keys containing Unicode lookalike characters (e.g. ʋ U+028B instead
of v) cause UnicodeEncodeError when httpx encodes the Authorization
header as ASCII. This commonly happens when users copy-paste keys from
PDFs, rich-text editors, or web pages with decorative fonts.
Three layers of defense:
1. **Save-time validation** (hermes_cli/config.py):
_check_non_ascii_credential() strips non-ASCII from credential values
when saving to .env, with a clear warning explaining the issue.
2. **Load-time sanitization** (hermes_cli/env_loader.py):
_sanitize_loaded_credentials() strips non-ASCII from credential env
vars (those ending in _API_KEY, _TOKEN, _SECRET, _KEY) after dotenv
loads them, so the rest of the codebase never sees non-ASCII keys.
3. **Runtime recovery** (run_agent.py):
The UnicodeEncodeError recovery block now also sanitizes self.api_key
and self._client_kwargs['api_key'], fixing the gap where message/tool
sanitization succeeded but the API key still caused httpx to fail on
the Authorization header.
Also: hermes_logging.py RotatingFileHandler now explicitly sets
encoding='utf-8' instead of relying on locale default (defensive
hardening for ASCII-locale systems).
PR #9467 added a call to self._fuzzy_file_completions() inside
_context_completions(), but the method was still decorated with
@staticmethod and didn't receive self. Every @ mention in the input
triggers 'name self is not defined' from prompt_toolkit's async
completer, spamming the error on every keystroke.
Fix: remove @staticmethod, add self parameter. The method already uses
self._fuzzy_file_completions() and self._get_project_files() via that
call chain, so it was never meant to stay static after the fuzzy search
feature was added.
Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.
Now the SIGUSR1 path blocks with progress feedback:
Phase 1 — wait for old process to die:
⏳ User service draining active work...
Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)
Phase 2 — wait for new process to become active:
⏳ Waiting for hermes-gateway to restart...
Polls systemctl is-active + verifies new PID (up to 60s)
Success:
✓ User service restarted (PID 12345)
Timeout:
⚠ User service did not become active within 60s.
Check status: hermes gateway status
Check logs: journalctl --user -u hermes-gateway --since '2 min ago'
The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.
Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass
When a session gets stuck (hung terminal, runaway tool loop) and the
user restarts the gateway, the same session history loads and puts the
agent right back in the stuck state. The user is trapped in a loop:
restart → stuck → restart → stuck.
Fix: track restart-failure counts per session using a simple JSON file
(.restart_failure_counts). On each shutdown with active agents, the
counter increments for those sessions. On startup, if any session has
been active across 3+ consecutive restarts, it's auto-suspended —
giving the user a clean slate on their next message.
The counter resets to 0 when a session completes a turn successfully
(response delivered), so normal sessions that happen to be active
during planned restarts (/restart, hermes update) won't accumulate
false counts.
Implementation:
- _increment_restart_failure_counts(): called during stop() when
agents are active. Writes {session_key: count} to JSON file.
Sessions NOT active are dropped (loop broken).
- _suspend_stuck_loop_sessions(): called on startup. Reads the file,
suspends sessions at threshold (3), clears the file.
- _clear_restart_failure_count(): called after successful response
delivery. Removes the session from the counter file.
No SessionEntry schema changes. No database migration. Pure file-based
tracking that naturally cleans up.
Test plan:
- 9 new stuck-loop tests (increment, accumulate, threshold, clear,
suspend, file cleanup, edge cases)
- All 28 gateway lifecycle tests pass (restart drain + auto-continue
+ stuck loop)
* feat(skills): add fitness-nutrition skill to optional-skills
Cherry-picked from PR #9177 by @haileymarshall.
Adds a fitness and nutrition skill for gym-goers and health-conscious users:
- Exercise search via wger API (690+ exercises, free, no auth)
- Nutrition lookup via USDA FoodData Central (380K+ foods, DEMO_KEY fallback)
- Offline body composition calculators (BMI, TDEE, 1RM, macros, body fat %)
- Pure stdlib Python, no pip dependencies
Changes from original PR:
- Moved from skills/ to optional-skills/health/ (correct location)
- Fixed BMR formula in FORMULAS.md (removed confusing -5+10, now just +5)
- Fixed author attribution to match PR submitter
- Marked USDA_API_KEY as optional (DEMO_KEY works without signup)
Also adds optional env var support to the skill readiness checker:
- New 'optional: true' field in required_environment_variables entries
- Optional vars are preserved in metadata but don't block skill readiness
- Optional vars skip the CLI capture prompt flow
- Skills with only optional missing vars show as 'available' not 'setup_needed'
* fix: increase CLI response text padding to 4-space tab indent
Increases horizontal padding on all response display paths:
- Rich Panel responses (main, background, /btw): padding (1,2) -> (1,4)
- Streaming text: add 4-space indent prefix to each line
- Streaming TTS: add 4-space indent prefix to sentences
Gives response text proper breathing room with a tab-width indent.
Rich Panel word wrapping automatically adjusts for the wider padding.
Requested by AriesTheCoder.
* fix: word-wrap verbose tool call args and results to terminal width
Verbose mode (tool_progress: verbose) printed tool args and results as
single unwrapped lines that could be thousands of characters long.
Adds _wrap_verbose() helper that:
- Pretty-prints JSON args with indent=2 instead of one-line dumps
- Splits text on existing newlines (preserves JSON/structured output)
- Wraps lines exceeding terminal width with 5-char continuation indent
- Uses break_long_words=True for URLs and paths without spaces
Applied to all 4 verbose print sites:
- Concurrent tool call args
- Concurrent tool results
- Sequential tool call args
- Sequential tool results
---------
Co-authored-by: haileymarshall <haileymarshall@users.noreply.github.com>
New users don't know which tool providers to pick during setup.
Add [badge] labels to each provider in the selection menu:
- [★ recommended · free] for best default choices (Edge TTS, Local Browser)
- [★ recommended] for top-tier paid options (Firecrawl Cloud)
- [paid] for options requiring an API key
- [free tier] for services with a free tier (Tavily)
- [free · self-hosted] / [free · local] for self-run options
- [subscription] for Nous subscription-managed options
Also improves vague tag descriptions — e.g. 'AI-native search and
contents' becomes 'Neural search with semantic understanding' and
Tavily gets '1000 free searches/mo'.
Both hermes setup and hermes tools share the same rendering path,
so badges appear in both flows.
Addresses user feedback about setup being confusing for newcomers.
When the gateway restarts mid-agent-work, the session transcript ends
on a tool result the agent never processed. Previously, the user had
to type 'continue' or use /retry (which replays from scratch, losing
all prior work).
Now, when the next user message arrives and the loaded history ends
with role='tool', a system note is prepended:
[System note: Your previous turn was interrupted before you could
process the last tool result(s). Please finish processing those
results and summarize what was accomplished, then address the
user's new message below.]
This is injected in _run_agent()'s run_sync closure, right before
calling agent.run_conversation(). The agent sees the full history
(including the pending tool results) and the system note, so it can
summarize what was accomplished and then handle the user's new input.
Design decisions:
- No new session flags or schema changes — purely detects trailing
tool messages in the loaded history
- Works for any restart scenario (clean, crash, SIGTERM, drain timeout)
as long as the session wasn't suspended (suspended = fresh start)
- The user's actual message is preserved after the note
- If the session WAS suspended (unclean shutdown), the old history is
abandoned and the user starts fresh — no false auto-continue
Also updates the shutdown notification message from 'Use /retry after
restart to continue' to 'Send any message after restart to resume
where it left off' — which is now accurate.
Test plan:
- 6 new auto-continue tests (trailing tool detection, no false
positives for assistant/user/empty history, multi-tool, message
preservation)
- All 13 restart drain tests pass (updated /retry assertion)
Update the Termux guide to mention that the browser tool now
automatically discovers Termux directories, and add the missing
pkg install nodejs-lts step.
Refactor browser tool PATH construction to include Termux directories
(/data/data/com.termux/files/usr/bin, /data/data/com.termux/files/usr/sbin)
so agent-browser and npx are discoverable on Android/Termux.
Extracts _browser_candidate_path_dirs() and _merge_browser_path() helpers
to centralize PATH construction shared between _find_agent_browser() and
_run_browser_command(), replacing duplicated inline logic.
Also fixes os.pathsep usage (was hardcoded ':') for cross-platform correctness.
Cherry-picked from PR #9846.
Adds --from flag to gmail send and gmail reply commands, allowing agents
to customize the From header display name when sharing the same email
account. Usage: --from '"Agent Name" <user@example.com>'
Also syncs repo google_api.py with the deployed standalone implementation
(replaces outdated gws_bridge thin wrapper), adds dedicated docs page
under Features > Skills, and updates sidebar navigation.
Requested by community user @Maxime44.
Add 'xai', 'x-ai', 'x.ai', 'grok' to _PROVIDER_PREFIXES so that
colon-prefixed model names (e.g. xai:grok-4.20) are stripped correctly
for context length lookups.
Cherry-picked from PR #9184 by @Julientalbot.
Instead of consuming one top-level slash command slot per skill (hitting the
100-command limit with ~26 built-ins + 74 skills), skills are now organized
under a single /skill group command with category-based subcommand groups:
/skill creative ascii-art [args]
/skill media gif-search [args]
/skill mlops axolotl [args]
Discord supports 25 subcommand groups × 25 subcommands = 625 max skills,
well beyond the previous 74-slot ceiling.
Categories are derived from the skill directory structure:
- skills/creative/ascii-art/ → category 'creative'
- skills/mlops/training/axolotl/ → category 'mlops' (top-level parent)
- skills/dogfood/ → uncategorized (direct subcommand)
Changes:
- hermes_cli/commands.py: add discord_skill_commands_by_category() with
category grouping, hub/disabled filtering, Discord limit enforcement
- gateway/platforms/discord.py: replace top-level skill registration with
_register_skill_group() using app_commands.Group hierarchy
- tests: 7 new tests covering group creation, category grouping,
uncategorized skills, hub exclusion, deep nesting, empty skills,
and handler dispatch
Inspired by Discord community suggestion from bottium.
When the gateway receives SIGTERM/SIGINT, the shutdown handler now
runs 'ps aux' and logs every hermes/gateway-related process (excluding
itself). This will show in agent.log as:
WARNING: Shutdown diagnostic — other hermes processes running:
hermes 1234 ... hermes update --gateway
hermes 5678 ... hermes gateway restart
This is the missing diagnostic for #5646 / #6666 — we can prove
the restarts are from systemctl but can't determine WHO issues the
systemctl command. Next time it happens, the agent.log will contain
the evidence (the process that sent the signal or called systemctl
should still be alive when the handler fires).
- Add glm-5v-turbo to OpenRouter, Nous, and native Z.AI model lists
- Add glm-5v context length entry (200K tokens) to model metadata
- Update Z.AI endpoint probe to try multiple candidate models per
endpoint (glm-5.1, glm-5v-turbo, glm-4.7) — fixes detection for
newer coding plan accounts that lack older models
- Add zai to _PROVIDER_VISION_MODELS so auxiliary vision tasks
(vision_analyze, browser screenshots) route through 5v
Fixes#9888
- Add ESC key binding (eager) for secret_state and sudo_state modal
prompts — fires immediately, same behavior as Ctrl+C cancel
- Update placeholder text: 'Enter to submit · ESC to skip' (was
'Enter to skip' which was confusing — Enter on empty looked like
submitting nothing rather than intentionally skipping)
- Update widget body text: 'ESC or Ctrl+C to skip'
- Change feedback message from 'Secret entry cancelled' to 'Secret
entry skipped' — more accurate for the action taken
- getpass fallback prompt also updated for non-TUI mode
Port of Cocoon AI's architecture-diagram-generator (MIT) as a Hermes skill.
Generates professional dark-themed system architecture diagrams as standalone
HTML/SVG files. Self-contained output, no dependencies.
- SKILL.md with design system specs, color palette, layout rules
- HTML template with all component types, arrow styles, legend examples
- Fits alongside excalidraw in creative/ category
Source: https://github.com/Cocoon-AI/architecture-diagram-generator
Add dangerous command patterns that require approval when the agent
tries to run gateway lifecycle commands via the terminal tool:
- hermes gateway stop/restart — kills all running agents mid-work
- hermes update — pulls code and restarts the gateway
- systemctl restart/stop (with optional flags like --user)
These patterns fire the approval prompt so the user must explicitly
approve before the agent can kill its own gateway process. In YOLO
mode, the commands run without approval (by design — YOLO means the
user accepts all risks).
Also fixes the existing systemctl pattern to handle flags between
the command and action (e.g. 'systemctl --user restart' was previously
undetected because the regex expected the action immediately after
'systemctl').
Root cause: issue #6666 reported agents running 'hermes gateway
restart' via terminal, killing the gateway process mid-agent-loop.
The user sees the agent suddenly stop responding with no explanation.
Combined with the SIGTERM auto-recovery from PR #9875, the gateway
now both prevents accidental self-destruction AND recovers if it
happens anyway.
Test plan:
- Updated test_systemctl_restart_not_flagged → test_systemctl_restart_flagged
- All 119 approval tests pass
- E2E verified: hermes gateway restart, hermes update, systemctl
--user restart all detected; hermes gateway status, systemctl
status remain safe
- TestHealthDetailedEndpoint: 3 tests for the new API server endpoint
(returns runtime data, handles missing status, no auth required)
- TestProbeGatewayHealth: 5 tests for _probe_gateway_health()
(URL normalization, successful/failed probes, fallback chain)
- TestStatusRemoteGateway: 4 tests for /api/status remote fallback
(remote probe triggers, skipped when local PID found, null PID handling)