PR #29119 dropped the 'not streamed_message' guard unconditionally so
that plugin-transformed responses (transform_llm_output hook) would
reach ACP clients. That regressed test_prompt_does_not_duplicate_streamed_final_message:
when no transform happened, the streamed text was re-sent as a duplicate
final delivery.
Tighten the condition to mirror the gateway side: deliver after streaming
only when response_transformed=True. Otherwise keep the old guard.
Adds test_prompt_delivers_transformed_response_after_streaming so the
transformed path stays covered.
Adds a test that fails without the gateway fix, exercising the
response_transformed=True branch in _finalize_response: a streamed
response whose final text was modified by a transform_llm_output
plugin hook must be edit_message'd in place (not duplicate-sent),
with already_sent=True so the normal final-send is skipped.
Also drops two minor leftovers from the salvaged PR #29119:
* accumulated_text property on GatewayStreamConsumer (unused)
* duplicate _response_transformed=False inside the hook try block
Closes#31370.
bws defaults to the US identity endpoint, so EU Cloud and self-hosted
machine-account tokens fail with [400 Bad Request] {"error":"invalid_client"}
during 'hermes secrets bitwarden setup'. The token is valid — it's just
being checked against the wrong region.
Add a Bitwarden region step to the wizard between the access-token and
project-list steps:
Step 1 Install bws
Step 2 Provide access token
Step 3 Pick region <-- new (US / EU / self-hosted-custom-URL)
Step 4 Pick project (now talks to the right endpoint)
Step 5 Test fetch
Region is stored in config.yaml as secrets.bitwarden.server_url and
plumbed into every bws subprocess as BWS_SERVER_URL (project list,
secret list, test fetch, and the env_loader startup pull).
Also:
- Non-interactive: 'hermes secrets bitwarden setup --server-url ...'
- Pre-existing BWS_SERVER_URL in the shell is detected and reused
- Cache key includes server_url so EU/US fetches don't collide
- 'hermes secrets bitwarden status' shows the configured region
- 'invalid_client' / '400 Bad Request' from bws now triggers a hint
pointing at the region setting instead of looking like a bad token
PR #6a1aa420e coupled `display.tool_progress: verbose` (a per-tool display
toggle for full args / results / think blocks) to `self.verbose` — which
controls root-logger DEBUG level. Result: setting tool_progress: verbose
in config silently flipped every module in the process to DEBUG and
flooded the terminal with internal logging, far beyond just full tool
calls.
The two concepts are separate:
- `tool_progress_mode == 'verbose'` → display behavior (tool rendering)
- `self.verbose` → logging behavior (root logger → DEBUG, line 9795)
This change keeps PR #6a1aa420e's argparse.SUPPRESS / config-fallback
plumbing but severs the verbose-display → debug-logging link.
Changes:
- cli.py:2868 — `self.verbose` only follows explicit `verbose=` arg; no
longer auto-True when tool_progress_mode == 'verbose'.
- cli.py:_toggle_verbose — slash-cycle through tool progress modes no
longer flips `self.verbose` / `agent.verbose_logging` / `agent.quiet_mode`.
- cli.py:9355 — fix misleading label (drop 'and debug logs').
- tui_gateway/server.py:_make_agent — same decoupling on the TUI side
(verbose_logging no longer derived from tool_progress_mode).
- tests/cli/test_tool_progress_scrollback.py — invert the test that
asserted the broken coupling; add coverage for explicit `--verbose`
still enabling DEBUG independent of tool_progress.
Live verified:
- tool_progress: verbose, no --verbose flag → 0 DEBUG/INFO log lines
- --verbose flag explicit → 32 DEBUG/INFO log lines (as expected)
When asyncio.sleep() fires just before Task.cancel() is called, CPython
sets _must_cancel=True but cannot cancel the already-completed sleep
future, so CancelledError is delivered at the next await (handle_message)
rather than at the sleep. By that point the superseded task has already
popped the merged event from _pending_text_batches, so the superseding
task sees an empty batch and silently drops the message.
Fix: add a synchronous task-registry check between the sleep and the pop.
No await between the check and the pop means no other coroutine can
interleave, so the guard is race-free.
When WeCom returns errcode=40001 (invalid credential) or 42001 (token
expired), send() was returning a failure without evicting the bad token
from _access_tokens. All subsequent sends then kept using the same
invalid cached token until its TTL naturally expired (~7200s).
Fix: on the first token-rejection errcode, evict the cache entry and
retry once with a freshly fetched token. Non-token errcodes fail
immediately as before. If the refreshed token also fails, the error
is returned without looping further.
Adds four regression tests covering: successful retry on 40001,
successful retry on 42001, no retry on unrelated errcode, and clean
failure when the refresh does not help.
* fix(profiles): cross-profile soft guard on file-write tools + system-prompt hint
Adds a soft guard so an agent running under one Hermes profile cannot
silently edit a different profile's skills/plugins/cron/memories.
Three layers:
A. agent/file_safety.classify_cross_profile_target
Classifies a write target against the active HERMES_HOME. Returns
a {active_profile, target_profile, area, target_path} dict when the
path lands in another profile's scoped area. PROFILE_SCOPED_AREAS =
(skills, plugins, cron, memories). get_cross_profile_warning()
wraps it into a model-facing error string that names both profiles,
names the area, and points at the cross_profile=True bypass.
Defense-in-depth, NOT a security boundary — the terminal tool runs
as the same OS user and can write any of these paths directly. The
guard exists to prevent confused-agent corruption, not to stop a
determined attacker. SECURITY.md §3.2 (terminal-bypass posture)
still applies.
Wired into tools/file_tools.write_file_tool and patch_tool with a
cross_profile=False kwarg. WRITE_FILE_SCHEMA and PATCH_SCHEMA both
advertise cross_profile so the model can pass it after explicit
user direction. patch_tool extracts target paths from V4A patch
bodies before checking (same shape as the existing sensitive-path
check).
skill_manage is already scoped to the active profile's SKILLS_DIR
by construction, so no extra guard wiring is needed there. The
D-side error message (below) still names other profiles when the
skill exists elsewhere.
B. agent/system_prompt
One deterministic line near the environment-hints block names the
active profile and tells the model not to modify another profile's
skills/plugins/cron/memories without explicit direction. Profile
name is stable for the lifetime of the AIAgent, so the line is
prompt-cache-safe.
D. tools/skill_manager_tool._skill_not_found_error
Replaces the bare "Skill 'X' not found." with a message that:
- names the active profile,
- searches OTHER profiles' skills dirs for the same name,
- names the profile(s) where the skill exists and the path,
- suggests `hermes -p <name>` to switch profiles, or
cross_profile=True for an explicit edit.
All 5 "not found" sites in skill_manager_tool (edit, patch, delete,
write_file, remove_file) now go through the helper.
Reference incident (May 2026): a hermes-security profile session
edited skills under both ~/.hermes/profiles/hermes-security/skills/
AND ~/.hermes/skills/ (the default profile's skills) without
realizing the second path belonged to a different profile. Three of
the four skill files needed manual restoration afterward.
What this PR does NOT do:
* No hard block. The terminal tool can still touch any of these
paths with no guard — same posture as the dangerous-command
approval flow. SECURITY.md §3.2 applies.
* No regex sweep on terminal commands for cross-profile paths.
That direction is a Skills-Guard-style arms race (cd + relative
paths, base64, etc.) and would false-positive on legitimate
cross-profile reads. Filed as a follow-up.
* No on-disk path migration. ~/.hermes/skills/ remains the
default profile's skills dir; this PR is about telling the
agent about that boundary, not changing the layout.
Tests:
tests/agent/test_file_safety_cross_profile.py (16 tests)
- _resolve_active_profile_name covers default/named/failure paths
- classify_cross_profile_target covers all four scoped areas,
both directions (default → named, named → default, named → named),
non-Hermes paths, and root-level config files
- get_cross_profile_warning covers in-profile no-op, cross-profile
message shape, and the defense-in-depth self-documentation
tests/tools/test_cross_profile_guard.py (12 tests)
- write_file: in-profile allow, cross-profile block, cross_profile=True
bypass, non-Hermes pass-through
- patch: replace-mode block, cross_profile=True bypass, V4A patch
path extraction
- skill_manage: error names the other profile (single + multiple),
missing-everywhere falls back to skills_list hint
- system prompt: contract-level checks (both branches present,
cross_profile=True mentioned, ~/.hermes/profiles/ referenced)
All 207 existing tests in file_safety/file_operations/skill_manager
still pass. 10 system-prompt tests still pass.
E2E verified: the exact incident scenario (security profile editing
default's hermes-agent-dev skill) is now blocked with the warning
message; cross_profile=True unblocks.
* fix(code_execution): add cross_profile to write_file/patch stubs
The cross_profile kwarg added to write_file_tool/patch_tool needs to
flow through the execute_code sandbox stubs in _TOOL_STUBS so the
test_stubs_cover_all_schema_params drift test passes. Without this,
scripts running inside execute_code couldn't pass cross_profile=True
through hermes_tools.write_file().
Caught by CI on PR #31290.
Adds an --ids flag to 'hermes kanban promote' mirroring the existing
block/schedule convention, so the marquee use case from issue #28822
(promote all children of a closed organizational parent in one shot)
doesn't require a shell loop. Single-id JSON output stays a flat
object for back-compat; bulk emits a list. Dedupes positional + --ids
so the same id can't be promoted twice in one call. 5 new CLI-level
tests cover bulk happy path, partial-failure exit code, JSON shapes,
and dedup.
Also adds the thedavidmurray noreply-email -> github-login mapping in
scripts/release.py so the salvage cherry-pick passes the AUTHOR_MAP
contributor-credit check.
Adds `hermes kanban promote <task_id>` for manual lifecycle recovery
when an auto-promote daemon misses the parent-done transition (issue
#28822). Refuses promotion unless every parent dep is done/archived
(override with --force). Emits a `promoted_manual` audit event distinct
from the automatic `promoted` kind, so audit consumers can filter
human-driven from system-driven promotions. Supports --dry-run and
--json for orchestration. Does not mutate assignee/claim state — the
dispatcher picks the card up via its normal ready polling path.
Closes#28822.
Two independent bugs caused the slash-command autocomplete to render
`/goal` as `/goa` (and `/gquota` as `/gquot` for that matter) in the TUI:
1. `tui_gateway/server.py` was forwarding `c.display` from
prompt_toolkit's `Completion` straight into the JSON-RPC payload.
prompt_toolkit normalizes `display=` into `FormattedText` (a `list`
subclass), so the wire format became `[["", "/goal"]]` instead of
the `string` that `CompletionItem.display` in the TUI declares.
`meta` already went through `to_plain_text` — `display` did not.
2. The dropdown row in `appOverlays.tsx` used `flexDirection="row"`
with the display `<Text>` and the (very long) meta `<Text>` as
siblings. When the meta overflows the row width, Ink/Yoga shrinks
the *first* column by one cell, lopping the trailing character off
the command name. `/goal` triggers it reliably because its meta
string is the longest of any built-in command (description +
embedded `[text | pause | resume | clear | status]` usage hint).
Wrapping the display column in `<Box flexShrink={0}>` keeps it at
its natural width and lets the meta wrap or truncate instead.
The original PR #17194 description claimed test_display_tool_preview.py
but only ever shipped test_display_todo_progress.py. Add the missing
coverage for the failure-suffix path:
- _trim_error: whitespace strip, length cap, File-not-found path collapse
- _detect_tool_failure: terminal exit codes, memory full, structured
{error}/{message} extraction, malformed JSON, None result
- get_cute_tool_message E2E: read_file failure, terminal exit-only,
terminal stderr message, memory full, success path, no-result path
Also update test_tool_progress_scrollback.test_error_suffix_on_failed_tool
to reflect the new behavior: the generic '[error]' fallback in cli.py
has been removed; failure suffixes now come from the result-aware
_detect_tool_failure (e.g. '[exit 1]', '[File not found: x]').
Parse the todo_tool result summary to display completion progress in
CLI tool preview lines:
Read: ┊ 📋 plan 3/4 task(s) 0.5s
Update: ┊ 📋 plan update 3/4 ✓ 0.5s
Create: falls back to plain count when no completed tasks
Falls back gracefully to the existing 'N task(s)' format when the
result is missing, malformed, or has no completed items.
Originally proposed in PR #17194 by Albert.Zhou; salvaged onto current
main.
Co-authored-by: Albert.Zhou <albert748@gmail.com>
`terminal(background=true)` without `notify_on_complete=true` or
`watch_patterns` runs the process SILENTLY — the agent has no way
to learn it finished short of calling `process(action='poll')`
explicitly. That's correct for genuine long-lived processes (servers,
watchers, daemons) but is a footgun for every bounded task (tests,
builds, deploys, CI pollers, batch jobs), which is the vast majority
of background uses.
Hit on May 23, 2026 (PR #31231 incident): agent launched a CI-watch
loop with `background=true` only. The poller ran fine, exited green
6 minutes later, agent never noticed. User had to surface 'we are
green CI, you can merge.' Memory and skill docs said *what* to do
(poll in background) but not *how* to receive the result. The
`notify_on_complete=true` flag exists and works, but is easy to
forget when bg seems sufficient on its own.
Two changes here, mutually reinforcing:
1. Runtime nudge: tool result for `background=true` w/o notify or
watch_patterns now includes a `hint` field explaining the silent-
process failure mode and pointing at the corrective flag. Agent
sees it on the same turn and self-corrects without needing the
user to surface anything. Cost for legitimate server cases is one
ignored read (~50 tokens); cost for forgot-notify cases is
prevented blindness (potentially many turns, or a user nudge).
False positives << false negatives.
2. Schema/description rewrite: top-level TERMINAL_TOOL_DESCRIPTION
and the `background` field description now lead with 'Almost
always pair with notify_on_complete=true' instead of presenting
it as one of two equally-likely patterns. The two legitimate
non-notify shapes (long-lived servers; watch_patterns mid-process
signals) are still documented, but as the minority case.
Tests cover all four shapes: bg-only emits hint, bg+notify doesn't,
bg+watch_patterns doesn't, foreground doesn't. 4 new tests; full
suite of background/process tests stays green (160/160 across the
relevant 6 test files).
AI Card "tool progress" cards created with finalize=False were left in
streaming state on DingTalk's UI after a gateway restart because
disconnect() called _streaming_cards.clear() without first closing
them via _close_streaming_siblings.
Move the finalization loop before self._http_client.aclose() so the
HTTP client is still available when the finalize requests are sent.
Adds a regression test that asserts the HTTP client is alive during
finalization.
The Windows branch of `_terminate_host_pid` early-returned after
`os.kill(pid, SIGTERM)` (which Python maps to `TerminateProcess` for
the target handle only), leaving descendant processes — e.g. Chromium
renderer/GPU/network helpers spawned by an `agent-browser` daemon —
running on Windows even after the preceding commit fixed POSIX.
The right Windows primitive is `taskkill /PID <pid> /T /F`:
`/T` walks the tree, `/F` force-terminates. Same approach
`gateway.status.terminate_pid(force=True)` already uses for the
gateway's own shutdown path; reuse the same shape here.
Why NOT extend the POSIX psutil tree-walk to Windows:
1. Windows doesn't maintain a Unix-style process tree. `psutil.
Process.children(recursive=True)` walks PPID links that go stale
when intermediate processes exit, so enumeration is best-effort
and silently misses orphaned descendants. The whole bug we're
fixing is orphaned descendants.
2. `psutil.Process.terminate()` on Windows is `TerminateProcess()`
for one handle — same single-PID scope as the existing
`os.kill`. The existing comment in `gateway/status.py::
terminate_pid` warns this explicitly: 'os.kill SIGTERM is not
equivalent to a tree-killing hard stop' on Windows.
3. Headless Chromium has no GUI window, so the softer
`taskkill /T` without `/F` (which sends WM_CLOSE) won't reach
it either. `/F` is required.
POSIX path is unchanged. The taskkill subprocess uses the same
`creationflags=windows_hide_flags()` pattern other Windows shellouts
in this codebase use. `FileNotFoundError` / `TimeoutExpired` /
`OSError` fall back to bare `os.kill(SIGTERM)` as cheap insurance.
Tests cover the Windows branch via the codebase's standard
`monkeypatch _IS_WINDOWS` pattern (`references/windows-native-
support.md`), plus POSIX tree-walk order, NoSuchProcess swallow,
and the OSError fallback path. 7 new tests, all green on Linux CI.
os.kill(pid, SIGTERM) only signals the parent, leaving Chromium child
processes (renderer, GPU, etc.) orphaned. Reuse the existing
ProcessRegistry._terminate_host_pid() helper which walks the process
tree leaf-up via psutil, terminating children before the parent.
Policy: if it ain't a secret it goes in config.yaml. HERMES_INFERENCE_PROVIDER
was leaking behavioral config into the .env surface, including from the gateway,
which bypassed config.yaml entirely.
Behavior:
- gateway/run.py: drop HERMES_INFERENCE_PROVIDER read in _resolve_runtime_agent_kwargs.
Gateway now flows through resolve_runtime_provider() with no `requested` override,
which reads model.provider from config.yaml first.
Docs/UX (strip env var from user-facing surface):
- --provider help text no longer mentions the env var
- cli-config.yaml.example same
- reference/environment-variables.md: remove HERMES_INFERENCE_PROVIDER row and
the cross-reference from HERMES_INFERENCE_MODEL
- reference/cli-commands.md: blank the env-var column for --provider
- guides/xai-grok-oauth.md, guides/minimax-oauth.md: replace
HERMES_INFERENCE_PROVIDER=x hermes invocations with config.yaml / --provider
- developer-guide/adding-providers.md, model-provider-plugin.md: reframe
Internal mechanism (kept as-is):
- hermes_cli/main.py writes HERMES_INFERENCE_PROVIDER into the TUI subprocess env
- tui_gateway/server.py reads it on TUI startup
- resolve_requested_provider() / oneshot.py / cli.py still fall through to the
env var as a last-resort behind config.yaml, which is what makes the TUI
parent->child handoff work
This stays. We just stop documenting it as a user knob.
Tests: tests/gateway/test_auth_fallback.py — simplify mock to fail on first
call, succeed on second; drop monkeypatch.setenv lines that no longer matter.
Supersedes #31064 (closed with credit to @novax635 who surfaced the underlying
issue but proposed aligning gateway *to* the env var rather than removing it).
Auxiliary LLM tasks (vision, compression, web_extract, etc.) currently
require modifications to core files for any plugin that needs its own
task slot — specifically the _AUX_TASKS list in hermes_cli/main.py and
the hardcoded env-var bridging dict in gateway/run.py. This violates
the 'plugins must not modify core files' rule and forces every memory
or context plugin that wants its own auxiliary task to either fork
core or open a coupled core+plugin PR.
This change adds a generic plugin surface for auxiliary task
registration:
ctx.register_auxiliary_task(
key='memory_retain_filter',
display_name='Memory retain filter',
description='hindsight pre-retain dedup/extract',
defaults={'timeout': 30, 'extra_body': {'reasoning_effort': 'low'}},
)
After registration, the task automatically:
- Appears in 'hermes model → Configure auxiliary models' picker via
a new _all_aux_tasks() merge of built-in + plugin tasks
- Has its provider/model/base_url/api_key bridged from config.yaml
to AUXILIARY_<KEY_UPPER>_* env vars at gateway startup
(gateway/run.py now uses a dynamic bridged-keys set instead of
a hardcoded per-task dict)
- Gets plugin-declared defaults (timeout, extra_body, etc.) layered
underneath user config so unconfigured plugin tasks still work
(agent/auxiliary_client._get_auxiliary_task_config)
- Resets to auto via 'Reset all to auto' alongside built-ins
Validation:
- Rejects shadowing of built-in keys (vision, compression, etc.)
- Rejects invalid key shapes (must match [A-Za-z0-9_]+)
- Rejects cross-plugin collisions (clear error)
- Allows same-plugin re-registration (idempotent updates)
Plugin discovery failures (rare) fall back gracefully — the aux
config UI still shows built-in tasks if get_plugin_auxiliary_tasks()
raises, and gateway env-var bridging keeps working for built-ins.
Built-in tasks remain hardcoded in _AUX_TASKS for stability — they're
the baseline UX, and DEFAULT_CONFIG already ships their defaults.
Plugin tasks layer on top.
Tests: 15 new tests in test_plugin_auxiliary_tasks.py covering API
validation, manager state lifecycle, helper sort order, _all_aux_tasks
merge semantics, _reset_aux_to_auto inclusion of plugin tasks, and
default-layering in auxiliary_client.
Updates the gateway-bridge code-parity test (test_auxiliary_config_bridge)
to assert the new dynamic shape rather than the hardcoded literal env
var names which no longer appear post-refactor.
Motivation: this unblocks PR #20262 (hindsight smart retain pipeline)
and similar plugins that need a dedicated aux task slot. The change
is non-breaking — built-in env vars (AUXILIARY_VISION_PROVIDER, etc.)
keep working since they're produced by the same f-string template
that built the hardcoded names.
Trim ~600 LOC off the original contribution while keeping the same
operator-facing surface and detection coverage.
- Collapse three entry points (file / dir / bundle) into one
ast_scan_path(path) that handles both files and directories.
- Drop AstFinding dataclass + severity field — replaced with plain
(file, line, pattern_id, description) tuples. Severity ordering was
display-only for a diagnostic that explicitly disclaims security
verdicts, so the field added bookkeeping without earning its place.
- Replace Rich-markup formatter with plain text grouped by file.
- Drop the 'inspect --ast-deep' surface — same scanner, same output as
'audit --deep', single CLI entry is enough. Operators audit after
install; pre-install inspection signal isn't worth the second surface.
- Trim test file to the cases that earn their place: bypass payload,
syntax error survival, RecursionError survival, false-positive guard
(importer lookalike), literal-arg false-positive guard, non-.py
ignored, directory recursion + cache-dir skipping, missing-path,
getattr/__dict__ detection, formatter empty + populated.
Net: tools/skills_ast_audit.py 353 -> 133 LOC,
tests/tools/test_skills_ast_audit.py 299 -> 103 LOC, full diff
+704/-12 -> +264/-6. No change to tools/skills_guard.py — Skills Guard
verdicts remain untouched per SECURITY.md §2.4.
Add opt-in AST diagnostics for skill review without making Skills Guard stricter by default.
- Add hermes skills inspect --ast-deep to scan fetched skill bundles before installation
- Add hermes skills audit --deep to scan already-installed hub skills
- Keep AST analysis in tools/skills_ast_audit.py, separate from tools/skills_guard.py
- Label output as diagnostic hints, not security verdicts
- Cover dynamic import/access patterns: importlib, __import__(computed), getattr(computed), and __dict__[computed]
This follows the maintainer guidance from closed PR #7436: useful AST-level analysis belongs in an opt-in diagnostic path, not in Skills Guard's default heuristic scan.
Null bytes in API key values (introduced by copy-paste) crash
os.environ[k] = v with ValueError: embedded null byte, preventing
hermes from starting at all.
Robustness:
- Surface 401/404 stream failures via _set_fatal_error() so the gateway's
runtime status reflects 'fatal: ntfy_unauthorized' / 'ntfy_topic_not_found'
instead of staying 'connected' when the reconnect loop halts. Matches
the pattern in whatsapp / telegram / sms adapters.
- Strip whitespace from auth tokens so pasted tokens with trailing
newlines don't produce malformed Authorization headers.
Simplicity:
- Extract _build_auth_header() and _truncate_body() to module-level
helpers, used by both NtfyAdapter and _standalone_send. Removes the
duplicated auth/truncation logic between the two paths.
Docs:
- website/docs/user-guide/messaging/ntfy.md — full setup guide,
identity-model warning, self-hosting, cron usage, troubleshooting.
- website/docs/reference/environment-variables.md — all 9 NTFY_* vars.
- website/docs/user-guide/messaging/index.md — platform comparison row.
- website/sidebars.ts — sidebar entry between simplex and open-webui.
Tests: 78/78 (+ 10 new robustness tests covering token hygiene, fatal
error propagation for 401/404, and the _truncate_body helper).
ntfy now ships as a self-contained plugin under plugins/platforms/ntfy/
instead of editing 8 core files (gateway/config.py Platform enum,
gateway/run.py factory + auth maps, cron/scheduler.py, toolsets.py,
hermes_cli/status.py, agent/prompt_builder.py, gateway/channel_directory.py,
tools/send_message_tool.py).
All routing goes through gateway/platform_registry via register_platform():
- adapter_factory, check_fn, validate_config, is_connected
- env_enablement_fn seeds PlatformConfig.extra from NTFY_* env vars so
gateway status reflects env-only setups without instantiating httpx
- standalone_sender_fn handles deliver=ntfy cron jobs when cron runs
out-of-process from the gateway
- allowed_users_env / allow_all_env hook into _is_user_authorized
- cron_deliver_env_var=NTFY_HOME_CHANNEL for cron home routing
- platform_hint surfaces in the system prompt
- pii_safe=True (topic names are the only identifier; no PII to redact)
Tests moved to tests/gateway/test_ntfy_plugin.py using _plugin_adapter_loader
so the module lives under plugin_adapter_ntfy in sys.modules and cannot
collide with sibling plugin-adapter tests on the same xdist worker. The
core-file grep tests (Platform.NTFY in source, hermes-ntfy in toolsets,
etc.) are replaced with plugin-shape tests covering register() metadata,
env_enablement_fn output, and standalone_sender_fn behavior.
68 tests pass under scripts/run_tests.sh.
First scratch workspace creation on an install now emits a one-shot
warning log + a 'tip_scratch_workspace' event on the task. Sentinel
file at ~/.hermes/kanban/.scratch_tip_shown silences subsequent
creations across the whole install.
Behavior unchanged — scratch is still ephemeral by design. This just
makes the design visible to new users (reported in user community:
'progress files vanished, no warning anywhere').
Docs (en + ko) updated to spell out 'Deleted when the task completes'
on the scratch bullet and 'Preserved on completion' on worktree/dir.
User incident (Slack, 2026-05-13): user walked away mid-conversation,
agent requested approval to run `rm -rf .git`, the prompt timed out
after the gateway_timeout (default 300s), and the agent removed the
.git folder on its own. Corroborated by an independent report from a
Telegram user.
The underlying code path was correct — `check_all_command_guards`
returns `approved=False` with a BLOCKED message on both timeout and
explicit deny, and `terminal_tool` surfaces that as `status=blocked`
to the agent. The bug is at the model-interface layer: the message
"BLOCKED: Command timed out. Do NOT retry this command." reads to
some models as "try a different command achieving the same outcome."
This commit changes only the model-facing message + the structured
return shape:
- Timeout message now explicitly names the three evasion paths the
agent must avoid: retry, rephrase, AND achieve the same outcome
via a different command. Ends with "Silence is not consent."
- Explicit deny gets the same shape minus the silence-is-not-consent
line (it WAS an explicit deny, not silence).
- New structured fields on the return dict: `outcome` ("timeout"
or "denied") and `user_consent` (always False on this branch)
so plugins, hooks, and audit pipelines don't have to string-parse
the message to distinguish the two cases.
The mechanism that should already have prevented the original incident
— timeout treated as deny, BLOCKED result, post hook fires with
`choice="timeout"` — is unchanged. This commit hardens only the
agent's reading of the result.
Tests:
- test_timeout_returns_approved_false_with_no_consent — pins the
return shape on the Slack-shaped notify_cb-registered path
- test_timeout_message_is_emphatic_against_retry_and_rephrase —
pins the exact phrases the message must contain
- test_explicit_deny_carries_same_no_consent_shape — same contract
on explicit /deny
- test_timeout_emits_post_hook_with_timeout_outcome — pins the
post_approval_response hook payload so audit plugins can act
329 approval tests passing (4 new + 325 existing).
Fixes#24912
Reproduction (production, 2026-05-14): two concurrent sessions on the
same agent. Session A patches MEMORY.md directly via the patch tool,
appending ~8KB of structured content (Vendor Master, Standing Orders,
Pin Board) — none of it through the memory tool, so no § delimiters.
Session B starts later with stale in-memory state (1 entry, ~331
chars). Session B calls memory(action=replace) on its one known
entry. The tool's _read_file parses A's content as a single 8KB
'entry' (no § splits), then replace truncates that entry to B's new
333-byte content. ~8KB of structured content silently destroyed.
The atomic-rename write path is fine in isolation. The bug is the
implicit contract: the tool assumes MEMORY.md is exclusively a
§-delimited list of small entries it wrote, but the v0.13 install
runbook itself uses 'cat >> MEMORY.md' for onboarding, the patch tool
edits the file directly, and operators do too.
Fix: a drift guard in MemoryStore._detect_external_drift that fires
on either signal:
1. Re-parse + re-serialize doesn't produce identical bytes
(catches oddly-encoded delimiters / partial writes).
2. Any single parsed entry exceeds the store's whole-file char
limit. The tool budgets the ENTIRE store against that limit
(2200 chars for memory, 1375 for user), so no tool-written
entry can legitimately be larger. An entry bigger than the
store limit means an external writer dropped free-form content
into what the tool will treat as one entry.
When drift fires, _reload_target writes a .bak.<ts> snapshot of the
on-disk file, then add/replace/remove refuse to flush. The original
file stays untouched. The error dict surfaces the .bak path AND a
remediation string ('integrate missing entries via memory(add=...)
one at a time, then rewrite the file clean') so the model can act on
it without escalating to the operator.
Tests:
- test_replace_refuses_on_drift, test_add_refuses_on_drift,
test_remove_refuses_on_drift — all three mutators refuse
- test_clean_file_does_not_trigger_drift — false-positive check
- test_error_message_points_at_remediation — error string shape
- test_drift_guard_also_protects_user_target — USER.md too
- test_drift_backup_filename_is_unique_per_invocation — bak.<ts>
naming pin
144 memory tests passing (was 137; +7).
Fixes#26045
Eleven new tests pinning the #29344 fix. Layout mirrors the existing
"Fix D" entitlement section so the bad-credentials disambiguator
sits alongside the entitlement-block tests it complements.
Classifier-level coverage:
* ``test_is_entitlement_failure_false_for_bad_credentials_wke_suffix``
— verbatim shape from the reporter's wire capture
(``{code: 'caller does not have permission', error: 'OAuth2 access
token could not be validated. [WKE=unauthenticated:bad-credentials]'}``)
↦ classifier must return False so the refresh path runs.
* ``test_is_entitlement_failure_false_for_wke_suffix_in_normalized_shape``
— same body after ``_extract_api_error_context`` has rewritten it
to ``{reason, message}``. The disambiguator must fire in BOTH
shapes; without this guard the production call site at
``_recover_with_credential_pool`` (which goes through the
normalised extractor) would still misclassify.
* ``test_is_entitlement_failure_false_for_any_wke_unauthenticated_variant``
— parametrised forward-compat: ``bad-credentials``,
``expired-token``, ``revoked``, ``some-future-reason``. xAI
documents the prefix as stable, the suffix after the colon as a
reason code that can grow; every variant under
``unauthenticated:`` must route to refresh.
* ``test_is_entitlement_failure_false_via_oauth2_validation_phrase_alone``
— belt-and-braces guard: if a future API revision drops the WKE
suffix but keeps "OAuth2 access token could not be validated", we
still classify correctly.
* ``test_is_entitlement_failure_wke_signal_overrides_entitlement_keywords``
— defensive: if a body ever carries BOTH the WKE suffix and
entitlement language, the WKE signal wins. Auth is recoverable;
entitlement isn't, and a refreshed token will resurface the
entitlement message on the next request.
* ``test_is_entitlement_failure_case_insensitive_wke_match`` —
pins that the classifier lowercases the haystack so a future xAI
build that uppercases the prefix doesn't reintroduce the bug.
Recovery-path coverage (end-to-end through
``_recover_with_credential_pool``):
* ``test_recover_with_credential_pool_refreshes_on_xai_bad_credentials_403``
— the headline test the reporter requested: a bad-credentials 403
with the exact wire body must call ``try_refresh_current()``
exactly once and ``_swap_credential`` once. Pre-fix this returned
``(False, _)`` because the entitlement classifier over-matched and
short-circuited the refresh path.
* ``test_recover_with_credential_pool_still_blocks_real_entitlement``
— companion regression guard for #26847: a pure unsubscribed-
account body (no WKE suffix, no OAuth2-validation phrase) must
still surface as entitlement and skip refresh. The new
disambiguator must not weaken the original loop-protection it
was added to preserve.
The scaffolding reuses ``_make_codex_agent``, ``_FakePool``, and the
existing ``MagicMock`` patterns from the surrounding tests so the
new section reads as a natural extension of "Fix D" rather than a
separate test file.
Closes#30045. Based on @qike-ms's PR #30141.
Telegram status callbacks (lifecycle, compression, context-pressure)
used to append a fresh bubble on every emit. Now adapter tracks
{(chat_id, status_key) -> message_id}; first call sends, subsequent
calls edit. Failed edits drop the cache entry and fall through to a
fresh send.
- gateway/platforms/telegram.py: send_or_update_status() (+34 LOC)
- gateway/run.py: route _status_callback_sync through it when the
adapter supports it; plain adapter.send() otherwise (+15 LOC)
- 5 tests covering first send / edit-in-place / edit-failure fallback
/ distinct key & chat isolation
PR 2362cc468 ("fix(gateway): enforce env variable template expansion
on runtime config loaders") refactored `_load_service_tier` to read
config via the new `_load_gateway_runtime_config` wrapper instead of
opening `_hermes_home/config.yaml` directly. The
`test_run_agent_passes_priority_processing_to_gateway_agent` test still
only stubbed `_load_gateway_config` (the inner loader), so the runtime
wrapper saw an empty config and `_load_service_tier` returned None,
breaking the test:
FAILED tests/gateway/test_fast_command.py::test_run_agent_passes_priority_processing_to_gateway_agent
- AssertionError: assert None == 'priority'
Fix: also stub `_load_gateway_runtime_config` to return the expected
`agent.service_tier=fast` config, so the test once again drives the
priority routing path it was written to verify.
Confirmed reproducing on current main before the patch and passing
after.
Follow-up to @sprmn24's verdict-logic fix. The previous block-message
ended in 'Use --force to override' regardless of verdict — but as of
the --force fix above, dangerous community/trusted skills can't be
overridden by --force at all. The misleading hint sends users in a
loop. Replace it with a specific message that tells them what the
documented behavior actually is.
Adds two regression tests covering the dangerous-verdict message
shape and one that pins the existing --force hint for non-dangerous
blocks.
`_deliver_kanban_artifacts` routes candidates through
`BasePlatformAdapter.filter_local_delivery_paths` (added in 41d2c758c),
which rejects paths outside `MEDIA_DELIVERY_SAFE_ROOTS`. The two
artifact-delivery tests create fixtures under `tmp_path`, which lives
outside the cache roots — so under CI's hermetic HOME the filter
silently dropped both fake files and the assertions on
`images_uploaded` / `documents_uploaded` failed.
Fix: monkeypatch `HERMES_MEDIA_ALLOW_DIRS=str(tmp_path)` in both tests
so the safety filter accepts the fixtures. Production behaviour
unchanged; test-side fix only.
CI fail repro on origin/main: test (6) shard, both
test_notifier_uploads_artifacts_on_completion and
test_notifier_artifact_delivery_skips_missing_files.