mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-09 03:11:58 +00:00
On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's
implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0
as ``CTRL_C_EVENT`` because the two integer values collide at the C
layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` —
which sends a Ctrl+C to the ENTIRE console process group containing
the target PID, not just the PID itself. Any caller that wanted to
check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)``
idiom was silently killing that process (and often unrelated
processes in the same console group) on Windows. Long-standing
Python Windows quirk; see bpo-14484 (open since 2012).
This manifested in Hermes as: every ``hermes gateway status``
invocation would read the gateway's PID from the PID file, call
``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a
"liveness check", and instantly terminate the gateway it was trying
to report on. No shutdown log, no traceback, no atexit hook fire,
no exit-diag entry — just silent termination of the detached pythonw
process. "Bot answered one message then stopped typing" was the
characteristic end-user symptom because `os.kill(pid, 0)` fires
mid-response-send and kills the gateway between logs.
Reproduction (verified in this branch before the fix):
$ hermes gateway start # gateway alive, PID 37520
$ hermes gateway status # reports "No gateway process detected"
$ tasklist /FI "PID eq 37520" # INFO: No tasks are running
# — gateway terminated silently
Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper:
- On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION |
SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)``
via ctypes. Zero signal delivery, zero console-group side effects.
Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs
on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER
(PID gone) from ERROR_ACCESS_DENIED (alive but another user).
- On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is
a no-op there.
Then patch every ``os.kill(pid, 0)`` liveness-check callsite to
route through ``_pid_exists`` instead. Total 14 callsites across
11 files; every single one was a latent silent-kill on Windows:
gateway/run.py:2810 — /restart watcher (inline subprocess)
gateway/run.py:15195 — --replace wait loop
gateway/status.py:572 — acquire_gateway_runtime_lock stale check
gateway/status.py:828 — get_running_pid (THE killer for status)
gateway/platforms/whatsapp.py:111
hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops
hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to
be cross-platform but used
os.kill(pid, 0) on Windows
hermes_cli/main.py:5792 — CLI process-kill polling
hermes_cli/profiles.py:782 — profile stop wait loop
plugins/google_meet/process_manager.py:74
tools/browser_tool.py:1215, 1255 — browser daemon ownership probes
tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking
The watcher source in gateway/run.py:2810 is a multi-line string
that gets spawned as an inline ``python -c "..."`` subprocess, so
it can't import gateway.status. The fix for that callsite inlines
the same ctypes probe directly into the watcher source.
Tested on Windows 10 with the hermes gateway + Telegram bot:
- gateway start → alive
- 5 consecutive ``hermes gateway status`` invocations → gateway
alive after every one, same PID reported each time (37520, 21952)
- gateway.log shows uninterrupted operation; no spurious shutdown
entries; cron ticker and kanban dispatcher still running on
their 60-second cadence
- bot continues answering Telegram messages throughout
Ships alongside an exit-path diagnostic wrapper in
``hermes_cli/gateway.py::run_gateway()`` that captures every way
``asyncio.run(start_gateway(...))`` can return (success, SystemExit,
KeyboardInterrupt, BaseException, atexit) with full traceback to
``logs/gateway-exit-diag.log``. This was used to prove the gateway
was being hard-killed externally (no exit event fired) and should
be kept for future Windows debugging.
Refs: https://bugs.python.org/issue14484
See also: references/windows-subprocess-sigint-storm.md in
the hermes-agent skill.
|
||
|---|---|---|
| .. | ||
| browser_providers | ||
| environments | ||
| neutts_samples | ||
| web_providers | ||
| __init__.py | ||
| ansi_strip.py | ||
| approval.py | ||
| binary_extensions.py | ||
| browser_camofox.py | ||
| browser_camofox_state.py | ||
| browser_cdp_tool.py | ||
| browser_dialog_tool.py | ||
| browser_supervisor.py | ||
| browser_tool.py | ||
| budget_config.py | ||
| checkpoint_manager.py | ||
| clarify_tool.py | ||
| code_execution_tool.py | ||
| credential_files.py | ||
| cronjob_tools.py | ||
| debug_helpers.py | ||
| delegate_tool.py | ||
| discord_tool.py | ||
| env_passthrough.py | ||
| feishu_doc_tool.py | ||
| feishu_drive_tool.py | ||
| file_operations.py | ||
| file_state.py | ||
| file_tools.py | ||
| fuzzy_match.py | ||
| homeassistant_tool.py | ||
| image_generation_tool.py | ||
| interrupt.py | ||
| kanban_tools.py | ||
| managed_tool_gateway.py | ||
| mcp_oauth.py | ||
| mcp_oauth_manager.py | ||
| mcp_tool.py | ||
| memory_tool.py | ||
| mixture_of_agents_tool.py | ||
| neutts_synth.py | ||
| openrouter_client.py | ||
| osv_check.py | ||
| patch_parser.py | ||
| path_security.py | ||
| process_registry.py | ||
| registry.py | ||
| rl_training_tool.py | ||
| schema_sanitizer.py | ||
| send_message_tool.py | ||
| session_search_tool.py | ||
| skill_manager_tool.py | ||
| skill_provenance.py | ||
| skill_usage.py | ||
| skills_guard.py | ||
| skills_hub.py | ||
| skills_sync.py | ||
| skills_tool.py | ||
| slash_confirm.py | ||
| terminal_tool.py | ||
| tirith_security.py | ||
| todo_tool.py | ||
| tool_backend_helpers.py | ||
| tool_output_limits.py | ||
| tool_result_storage.py | ||
| transcription_tools.py | ||
| tts_tool.py | ||
| url_safety.py | ||
| vision_tools.py | ||
| voice_mode.py | ||
| web_tools.py | ||
| website_policy.py | ||
| xai_http.py | ||
| yuanbao_tools.py | ||