mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-12 03:42:08 +00:00
fix(windows): os.kill(pid, 0) is NOT a no-op on Windows — route through new _pid_exists helper
On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's
implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0
as ``CTRL_C_EVENT`` because the two integer values collide at the C
layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` —
which sends a Ctrl+C to the ENTIRE console process group containing
the target PID, not just the PID itself. Any caller that wanted to
check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)``
idiom was silently killing that process (and often unrelated
processes in the same console group) on Windows. Long-standing
Python Windows quirk; see bpo-14484 (open since 2012).
This manifested in Hermes as: every ``hermes gateway status``
invocation would read the gateway's PID from the PID file, call
``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a
"liveness check", and instantly terminate the gateway it was trying
to report on. No shutdown log, no traceback, no atexit hook fire,
no exit-diag entry — just silent termination of the detached pythonw
process. "Bot answered one message then stopped typing" was the
characteristic end-user symptom because `os.kill(pid, 0)` fires
mid-response-send and kills the gateway between logs.
Reproduction (verified in this branch before the fix):
$ hermes gateway start # gateway alive, PID 37520
$ hermes gateway status # reports "No gateway process detected"
$ tasklist /FI "PID eq 37520" # INFO: No tasks are running
# — gateway terminated silently
Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper:
- On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION |
SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)``
via ctypes. Zero signal delivery, zero console-group side effects.
Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs
on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER
(PID gone) from ERROR_ACCESS_DENIED (alive but another user).
- On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is
a no-op there.
Then patch every ``os.kill(pid, 0)`` liveness-check callsite to
route through ``_pid_exists`` instead. Total 14 callsites across
11 files; every single one was a latent silent-kill on Windows:
gateway/run.py:2810 — /restart watcher (inline subprocess)
gateway/run.py:15195 — --replace wait loop
gateway/status.py:572 — acquire_gateway_runtime_lock stale check
gateway/status.py:828 — get_running_pid (THE killer for status)
gateway/platforms/whatsapp.py:111
hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops
hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to
be cross-platform but used
os.kill(pid, 0) on Windows
hermes_cli/main.py:5792 — CLI process-kill polling
hermes_cli/profiles.py:782 — profile stop wait loop
plugins/google_meet/process_manager.py:74
tools/browser_tool.py:1215, 1255 — browser daemon ownership probes
tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking
The watcher source in gateway/run.py:2810 is a multi-line string
that gets spawned as an inline ``python -c "..."`` subprocess, so
it can't import gateway.status. The fix for that callsite inlines
the same ctypes probe directly into the watcher source.
Tested on Windows 10 with the hermes gateway + Telegram bot:
- gateway start → alive
- 5 consecutive ``hermes gateway status`` invocations → gateway
alive after every one, same PID reported each time (37520, 21952)
- gateway.log shows uninterrupted operation; no spurious shutdown
entries; cron ticker and kanban dispatcher still running on
their 60-second cadence
- bot continues answering Telegram messages throughout
Ships alongside an exit-path diagnostic wrapper in
``hermes_cli/gateway.py::run_gateway()`` that captures every way
``asyncio.run(start_gateway(...))`` can return (success, SystemExit,
KeyboardInterrupt, BaseException, atexit) with full traceback to
``logs/gateway-exit-diag.log``. This was used to prove the gateway
was being hard-killed externally (no exit event fired) and should
be kept for future Windows debugging.
Refs: https://bugs.python.org/issue14484
See also: references/windows-subprocess-sigint-storm.md in
the hermes-agent skill.
This commit is contained in:
parent
ac178b78c4
commit
1cbe399149
11 changed files with 296 additions and 151 deletions
|
|
@ -1211,19 +1211,10 @@ def _reap_orphaned_browser_sessions():
|
|||
if os.path.isfile(owner_pid_file):
|
||||
try:
|
||||
owner_pid = int(Path(owner_pid_file).read_text(encoding="utf-8").strip())
|
||||
try:
|
||||
os.kill(owner_pid, 0)
|
||||
owner_alive = True
|
||||
except ProcessLookupError:
|
||||
owner_alive = False
|
||||
except PermissionError:
|
||||
# Owner exists but we can't signal it (different uid).
|
||||
# Treat as alive — don't reap someone else's session.
|
||||
owner_alive = True
|
||||
except OSError:
|
||||
# Windows: gone PID raises OSError (WinError 87) instead
|
||||
# of ProcessLookupError. Treat as dead to match POSIX.
|
||||
owner_alive = False
|
||||
# ``os.kill(pid, 0)`` is NOT a no-op on Windows (bpo-14484).
|
||||
# Use the cross-platform existence check.
|
||||
from gateway.status import _pid_exists
|
||||
owner_alive = _pid_exists(owner_pid)
|
||||
except (ValueError, OSError):
|
||||
owner_alive = None # corrupt file — fall through
|
||||
|
||||
|
|
@ -1250,19 +1241,10 @@ def _reap_orphaned_browser_sessions():
|
|||
shutil.rmtree(socket_dir, ignore_errors=True)
|
||||
continue
|
||||
|
||||
# Check if the daemon is still alive
|
||||
try:
|
||||
os.kill(daemon_pid, 0) # signal 0 = existence check
|
||||
except ProcessLookupError:
|
||||
# Already dead, just clean up the dir
|
||||
shutil.rmtree(socket_dir, ignore_errors=True)
|
||||
continue
|
||||
except PermissionError:
|
||||
# Alive but owned by someone else — leave it alone
|
||||
continue
|
||||
except OSError:
|
||||
# Windows raises OSError (WinError 87) for a gone PID — treat
|
||||
# as dead and clean up, mirroring the ProcessLookupError branch.
|
||||
# Check if the daemon is still alive. ``os.kill(pid, 0)`` on Windows
|
||||
# is NOT a no-op — use the handle-based existence check.
|
||||
from gateway.status import _pid_exists
|
||||
if not _pid_exists(daemon_pid):
|
||||
shutil.rmtree(socket_dir, ignore_errors=True)
|
||||
continue
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue