The Windows branch of `_terminate_host_pid` early-returned after
`os.kill(pid, SIGTERM)` (which Python maps to `TerminateProcess` for
the target handle only), leaving descendant processes — e.g. Chromium
renderer/GPU/network helpers spawned by an `agent-browser` daemon —
running on Windows even after the preceding commit fixed POSIX.
The right Windows primitive is `taskkill /PID <pid> /T /F`:
`/T` walks the tree, `/F` force-terminates. Same approach
`gateway.status.terminate_pid(force=True)` already uses for the
gateway's own shutdown path; reuse the same shape here.
Why NOT extend the POSIX psutil tree-walk to Windows:
1. Windows doesn't maintain a Unix-style process tree. `psutil.
Process.children(recursive=True)` walks PPID links that go stale
when intermediate processes exit, so enumeration is best-effort
and silently misses orphaned descendants. The whole bug we're
fixing is orphaned descendants.
2. `psutil.Process.terminate()` on Windows is `TerminateProcess()`
for one handle — same single-PID scope as the existing
`os.kill`. The existing comment in `gateway/status.py::
terminate_pid` warns this explicitly: 'os.kill SIGTERM is not
equivalent to a tree-killing hard stop' on Windows.
3. Headless Chromium has no GUI window, so the softer
`taskkill /T` without `/F` (which sends WM_CLOSE) won't reach
it either. `/F` is required.
POSIX path is unchanged. The taskkill subprocess uses the same
`creationflags=windows_hide_flags()` pattern other Windows shellouts
in this codebase use. `FileNotFoundError` / `TimeoutExpired` /
`OSError` fall back to bare `os.kill(SIGTERM)` as cheap insurance.
Tests cover the Windows branch via the codebase's standard
`monkeypatch _IS_WINDOWS` pattern (`references/windows-native-
support.md`), plus POSIX tree-walk order, NoSuchProcess swallow,
and the OSError fallback path. 7 new tests, all green on Linux CI.
Sweep of all CI failures on origin/main, grouped by drift source:
Telegram allowlist gate (db50af910 added user-authz to _should_process_message):
- Hardcoded "[Telegram]" prefix in the logger.warning so the call no
longer dereferences self.name → self.platform, which test fixtures
built via object.__new__ never set.
- test_telegram_format / test_allowed_channels_widening fixtures stub
_is_callback_user_authorized → True so the new gate doesn't reject
guest-mode / allowed-channels test messages.
- test_telegram_approval_buttons::test_update_prompt_callback_not_affected
sets TELEGRAM_ALLOWED_USERS="*" so the fail-closed default doesn't
reject the callback before it writes .update_response.
Approval surface (6d495d9e7 renamed status, 214b95392 detached stdin):
- test_no_callback_returns_approval_required: status is now
"pending_approval" (was "approval_required").
- test_close_stdin_allows_eof_driven_process_to_finish: switch to
use_pty=True; non-PTY now uses stdin=DEVNULL.
Mattermost (send() now resolves root_id via _api_get first):
- test_send_with_thread_reply mocks _session.get with a thread-root
response so the new resolver doesn't TypeError on a bare AsyncMock.
Kanban (d8ad431de rename, f55d94a1e review column, _kanban_worker_skill_available):
- _safe_int → _to_epoch in the two test_kanban_db tests.
- Spawn-skills tests (×3) monkey-patch _kanban_worker_skill_available
to True since the isolated kanban_home fixture has no devops/kanban-worker tree.
- test_gateway_dispatcher_disables_corrupt_board: connect count
3 → 5 (review-column probe now also runs per tick).
Aux-config severity at_or_above (a94ddd807):
- test_diagnostics_endpoint_severity_filter expects warning filter to
include error+critical now (was exact-match).
Anthropic error handling (conversation loop extracted from run_agent):
- _no_backoff_wait fixture patches BOTH run_agent.jittered_backoff AND
agent.conversation_loop.jittered_backoff. The latter is the actual
call site; without the second patch tests burn ~2s per retry and
hit the 30s SIGALRM timeout on CI.
Other test pollution / drift:
- test_auto_does_not_select_copilot_from_github_token: patch
agent.bedrock_adapter.has_aws_credentials → False so boto3's
credential chain can't auto-pick Bedrock from developer ~/.aws.
- test_setup_openclaw_migration: patch hermes_cli.gateway.get_env_value
in addition to setup_mod.get_env_value — _platform_status reads
through the gateway module's binding.
- test_gateway_prefix: COMPONENT_PREFIXES["gateway"] now includes
"hermes_plugins" too.
- test_recommended_update_command_defaults_to_hermes_update: also
short-circuit get_managed_update_command in case a stray
~/.hermes/.managed marker is present.
- test_user_id_is_not_explicit: _parse_target_ref now returns
is_explicit=False for Slack U.../W... IDs (chat.postMessage rejects
them — a DM must be opened first via conversations.open).
After Popen succeeds with os.setsid (detached process group), 5 things
happen with no try/except: Thread construction, reader.start(), lock
acquisition, prune+register, checkpoint write. If any raises, the
Popen object goes unregistered and the detached process group leaks
indefinitely.
Wrap the post-spawn setup in try/except. On failure:
- os.killpg(getpgid(pid), SIGKILL) takes down the entire process
group (not just the shell - important because of detached PG +
-lic shell wrapper that may have spawned children)
- proc.kill() fallback for ProcessLookupError/PermissionError/OSError
- proc.wait(timeout=5) reaps with a bound
- re-raise to preserve original traceback
Nested try/except around cleanup so a secondary failure can't mask the
original.
Closes#2749.
PR #21561 migrated liveness probes across 14 call sites from
`os.kill(pid, 0)` to `gateway.status._pid_exists` (psutil-first) so
the gateway doesn't Ctrl+C-itself on Windows via bpo-14484. A handful of
tests still patched the old `os.kill` seam and either happened to pass
on POSIX (when PID 12345 incidentally wasn't alive on the CI worker) or
failed outright — on CI runs they surfaced as 7 flaky/stable failures.
Migrate each affected test to patch the correct seam:
- tests/tools/test_browser_orphan_reaper.py (5 tests)
Patch `gateway.status._pid_exists` instead of `os.kill`.
Rename test_permission_error_on_kill_check_skips to
test_alive_legacy_daemon_is_reaped — the old assertion was
"PermissionError on sig 0 → skip dir"; post-migration the
untracked-alive-daemon path always reaps the dir after SIGTERM
(best-effort semantics were preserved).
- tests/tools/test_windows_native_support.py (4 tests)
Replace tests that asserted `os.kill` seam behavior with tests
that exercise `ProcessRegistry._is_host_pid_alive` as a
delegator and split out a new TestPidExistsOSErrorWidening class
that hits `gateway.status._pid_exists` directly via the POSIX
fallback branch (so Windows-style `OSError(WinError 87)` + `PermissionError`
widening is still covered on Linux CI).
- tests/tools/test_process_registry.py (1 test)
Mock `psutil.Process` + `_pid_exists` instead of `os.kill`
for the detached-session kill path.
- tests/tools/test_mcp_stability.py::test_kill_orphaned_uses_sigkill_when_available
SIGTERM → alive-check → SIGKILL flow now uses `_pid_exists`
for the middle step; assertion count drops from 3 to 2.
- tests/gateway/test_status.py::TestScopedLocks (2 tests)
`acquire_scoped_lock` consults `_pid_exists`; patch that
seam directly instead of trying to control the nested psutil
call via os.kill monkeypatch.
- tests/hermes_cli/test_gateway.py::test_stop_profile_gateway_keeps_pid_file_when_process_still_running
The stop loop sends one SIGTERM via os.kill then polls 20x via
_pid_exists; instrument both separately. Old assertion
`calls["kill"] == 21` split into `kill == 1` + `alive_probes == 20`.
- tests/hermes_cli/test_auth_toctou_file_modes.py::test_shared_nous_store_writes_0o600_with_0o700_parent
Commit c34884ea2 switched the pytest seat-belt guard in
`_nous_shared_store_path()` from `Path.home() / ".hermes"`
to `get_default_hermes_root()`, which honors HERMES_HOME. The
test sets both HERMES_HOME and HERMES_SHARED_AUTH_DIR to
subpaths of the same tmp_path, and the override now collapses
onto the same path the guard is refusing. Renamed the override
subdirectory so the two paths diverge — guard passes, test runs.
All 21 original CI failures and their local-flaky siblings now pass
(278 tests across the touched files, 0 failures).
When a background terminal process spawns a descendant daemon that
inherits the stdout pipe (e.g. 'hermes update' triggering a gateway
systemctl restart), the reader thread's stdout.read() never returns EOF
and its finally: block never runs. session.exited stays False forever,
so process(action='poll') returns 'running' indefinitely even though
the direct child exited long ago.
Issue #17327: Feishu user polled 74 times over 7 minutes before killing
the gateway manually.
Fix: add _reconcile_local_exit() that checks the direct Popen.poll()
before trusting session.exited. If the direct child has exited, drain
any immediately-readable bytes non-blocking and flip session.exited.
Called from poll() and wait(). The stuck reader thread remains blocked
but is a daemon thread and gets reaped with the process.
Safe no-op for env/PTY sessions, already-exited sessions, and live
children (returns None from Popen.poll()).
Background process watchers (notify_on_complete, check_interval) created
synthetic SessionSource objects without user_id/user_name. While the
internal=True bypass (1d8d4f28) prevented false pairing for agent-
generated notifications, the missing identity caused:
- Garbage entries in pairing rate limiters (discord:None, telegram:None)
- 'User None' in approval messages and logs
- No user identity available for future code paths that need it
Additionally, platform messages arriving without from_user (Telegram
service messages, channel forwards, anonymous admin actions) could still
trigger false pairing because they are not internal events.
Fix:
1. Propagate user_id/user_name through the full watcher chain:
session_context.py → gateway/run.py → terminal_tool.py →
process_registry.py (including checkpoint persistence/recovery)
2. Add None user_id guard in _handle_message() — silently drop
non-internal messages with no user identity instead of triggering
the pairing flow.
Salvaged from PRs #7664 (kagura-agent, ContextVar approach),
#6540 (MestreY0d4-Uninter, tests), and #7709 (guang384, None guard).
Closes#6341, #6485, #7643
Relates to #6516, #7392
Previously crash recovery recreated detached sessions as if they were
fully managed, so polls and kills could lie about liveness and the
checkpoint could forget recovered jobs after the next restart.
This commit refreshes recovered host-backed sessions from real PID
state, keeps checkpoint data durable, and preserves notify watcher
metadata while treating sandbox-only PIDs as non-recoverable.
- Persist `pid_scope` in `tools/process_registry.py` and skip
recovering sandbox-backed entries without a host-visible PID handle
- Refresh detached sessions on access so `get`/`poll`/`wait` and active
session queries observe exited processes instead of hanging forever
- Allow recovered host PIDs to be terminated honestly and requeue
`notify_on_complete` watchers during checkpoint recovery
- Add regression tests for durable checkpoints, detached exit/kill
behavior, sandbox skip logic, and recovered notify watchers
Salvaged from PR #1573 by @eren-karakus0. Cherry-picked with authorship preserved.
Fixes#1143 — background process notifications resume after gateway restart.
Co-authored-by: Muhammet Eren Karakuş <erenkar950@gmail.com>
Extend subprocess env sanitization beyond provider credentials by blocking Hermes-managed tool, messaging, and related gateway runtime vars. Reuse a shared sanitizer in LocalEnvironment and ProcessRegistry so background and PTY processes honor the same blocklist and _HERMES_FORCE_ escape hatch. Add regression coverage for local env execution and process_registry spawning.
Cover model_tools, toolset_distributions, context_compressor,
prompt_caching, cronjob_tools, session_search, process_registry,
and cron/scheduler with 127 new test cases.