hermes-agent/gateway/platforms
Teknium 1cbe399149 fix(windows): os.kill(pid, 0) is NOT a no-op on Windows — route through new _pid_exists helper
On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's
implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0
as ``CTRL_C_EVENT`` because the two integer values collide at the C
layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` —
which sends a Ctrl+C to the ENTIRE console process group containing
the target PID, not just the PID itself. Any caller that wanted to
check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)``
idiom was silently killing that process (and often unrelated
processes in the same console group) on Windows. Long-standing
Python Windows quirk; see bpo-14484 (open since 2012).

This manifested in Hermes as: every ``hermes gateway status``
invocation would read the gateway's PID from the PID file, call
``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a
"liveness check", and instantly terminate the gateway it was trying
to report on. No shutdown log, no traceback, no atexit hook fire,
no exit-diag entry — just silent termination of the detached pythonw
process. "Bot answered one message then stopped typing" was the
characteristic end-user symptom because `os.kill(pid, 0)` fires
mid-response-send and kills the gateway between logs.

Reproduction (verified in this branch before the fix):

  $ hermes gateway start       # gateway alive, PID 37520
  $ hermes gateway status      # reports "No gateway process detected"
  $ tasklist /FI "PID eq 37520"  # INFO: No tasks are running
                                 # — gateway terminated silently

Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper:

- On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION |
  SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)``
  via ctypes. Zero signal delivery, zero console-group side effects.
  Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs
  on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER
  (PID gone) from ERROR_ACCESS_DENIED (alive but another user).
- On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is
  a no-op there.

Then patch every ``os.kill(pid, 0)`` liveness-check callsite to
route through ``_pid_exists`` instead. Total 14 callsites across
11 files; every single one was a latent silent-kill on Windows:

  gateway/run.py:2810      — /restart watcher (inline subprocess)
  gateway/run.py:15195     — --replace wait loop
  gateway/status.py:572    — acquire_gateway_runtime_lock stale check
  gateway/status.py:828    — get_running_pid (THE killer for status)
  gateway/platforms/whatsapp.py:111
  hermes_cli/gateway.py:228, 522, 1012  — gateway-related drain loops
  hermes_cli/kanban_db.py:2826         — _pid_alive was claiming to
                                         be cross-platform but used
                                         os.kill(pid, 0) on Windows
  hermes_cli/main.py:5792        — CLI process-kill polling
  hermes_cli/profiles.py:782     — profile stop wait loop
  plugins/google_meet/process_manager.py:74
  tools/browser_tool.py:1215, 1255  — browser daemon ownership probes
  tools/mcp_tool.py:1255, 3374     — MCP stdio orphan tracking

The watcher source in gateway/run.py:2810 is a multi-line string
that gets spawned as an inline ``python -c "..."`` subprocess, so
it can't import gateway.status. The fix for that callsite inlines
the same ctypes probe directly into the watcher source.

Tested on Windows 10 with the hermes gateway + Telegram bot:
- gateway start → alive
- 5 consecutive ``hermes gateway status`` invocations → gateway
  alive after every one, same PID reported each time (37520, 21952)
- gateway.log shows uninterrupted operation; no spurious shutdown
  entries; cron ticker and kanban dispatcher still running on
  their 60-second cadence
- bot continues answering Telegram messages throughout

Ships alongside an exit-path diagnostic wrapper in
``hermes_cli/gateway.py::run_gateway()`` that captures every way
``asyncio.run(start_gateway(...))`` can return (success, SystemExit,
KeyboardInterrupt, BaseException, atexit) with full traceback to
``logs/gateway-exit-diag.log``. This was used to prove the gateway
was being hard-killed externally (no exit event fired) and should
be kept for future Windows debugging.

Refs: https://bugs.python.org/issue14484
See also: references/windows-subprocess-sigint-storm.md in
the hermes-agent skill.
2026-05-08 12:34:27 -07:00
..
qqbot feat(qqbot): wire native tool-approval UX via inline keyboards 2026-05-07 07:48:15 -07:00
__init__.py yuanbao platform (#16298) 2026-04-26 18:50:49 -07:00
_http_client_limits.py fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) 2026-05-02 02:23:37 -07:00
ADDING_A_PLATFORM.md docs(platforms): document env_enablement_fn + cron_deliver_env_var hooks (#21331) 2026-05-07 07:36:42 -07:00
api_server.py fix(gateway): log agent task failures instead of silently losing usage data 2026-05-07 06:25:03 -07:00
base.py fix(gateway): consolidate runtime-status writes + rate-limit failure logs 2026-05-07 06:30:26 -07:00
bluebubbles.py fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) 2026-05-02 02:23:37 -07:00
dingtalk.py feat(gateway): add allowed_{chats,channels,rooms} whitelist to Telegram, Mattermost, Matrix, DingTalk 2026-05-07 06:54:29 -07:00
discord.py fix(discord): route DM role-auth opt-in through config.yaml (not env var) 2026-05-07 05:51:56 -07:00
email.py fix(email): drop non-allowlisted senders before dispatch to prevent mail loops 2026-05-04 12:35:22 -07:00
feishu.py fix(gateway): use monotonic deadlines in QR onboarding flows 2026-05-07 05:09:39 -07:00
feishu_comment.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
feishu_comment_rules.py fix(feishu-comment): use get_hermes_home(); drop dead asyncio wrapper; AUTHOR_MAP 2026-04-17 19:04:11 -07:00
helpers.py fix(gateway): ensure deterministic thread eviction in helpers 2026-05-05 10:13:55 -07:00
homeassistant.py fix(gateway): correct ws scheme conversion for https urls 2026-05-03 03:54:03 -07:00
matrix.py feat(gateway): add allowed_{chats,channels,rooms} whitelist to Telegram, Mattermost, Matrix, DingTalk 2026-05-07 06:54:29 -07:00
mattermost.py feat(gateway): add allowed_{chats,channels,rooms} whitelist to Telegram, Mattermost, Matrix, DingTalk 2026-05-07 06:54:29 -07:00
signal.py fix(signal): skip reactions for unauthorized senders 2026-05-04 01:38:21 -07:00
signal_rate_limit.py feat(gateway/signal): add support for multiple images sending 2026-04-30 04:28:08 -07:00
slack.py feat(slack): add allowed_channels whitelist config 2026-05-07 06:54:29 -07:00
sms.py test(sms): use clear=True in test_missing_phone_number_is_non_retryable 2026-05-04 05:25:09 -07:00
telegram.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-07 19:24:45 -07:00
telegram_network.py fix(gateway): keep DoH-confirmed Telegram IPs that match system DNS (#14520) 2026-05-05 04:42:59 -07:00
webhook.py fix(webhook): widen INSECURE_NO_AUTH loopback check + tests + docs 2026-05-07 07:38:43 -07:00
wecom.py fix(gateway): use monotonic deadlines in QR onboarding flows 2026-05-07 05:09:39 -07:00
wecom_callback.py fix(gateway): tighten httpx keepalive and close whatsapp typing-response leak (#18451) 2026-05-02 02:23:37 -07:00
wecom_crypto.py feat(gateway): add WeCom callback-mode adapter for self-built apps 2026-04-11 15:22:49 -07:00
weixin.py fix(weixin): wrap long copy-unfriendly lines 2026-05-07 06:08:06 -07:00
whatsapp.py fix(windows): os.kill(pid, 0) is NOT a no-op on Windows — route through new _pid_exists helper 2026-05-08 12:34:27 -07:00
yuanbao.py fix(yuanbao): enforce owner identity check on group slash commands 2026-04-30 23:57:55 -07:00
yuanbao_media.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
yuanbao_proto.py chore: remove unused imports and dead locals (ruff F401, F841) (#17010) 2026-04-28 06:46:45 -07:00
yuanbao_sticker.py yuanbao platform (#16298) 2026-04-26 18:50:49 -07:00