hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-18 04:41:56 +00:00

Author	SHA1	Message	Date
Daniel Marta	1fb9f7c68c	fix(gateway): pass max_total_size_mb and max_file_size_mb to CheckpointManager The /rollback command handler in gateway/run.py was constructing CheckpointManager with only enabled and max_snapshots, omitting max_total_size_mb and max_file_size_mb that the __init__ expects. This caused a TypeError on every /rollback invocation when checkpoints were enabled. Fixes: NousResearch/hermes-agent#18841	2026-05-09 17:54:44 -07:00
Wesley Simplicio	246c676c2b	fix(gateway): degrade gracefully when all platform adapters are missing When connected_count == 0 AND enabled_platform_count > 0, the gateway treated 'all adapters returned None' identically to 'all adapters failed to connect' — both as fatal startup errors. The 'returned None' case happens when imports fail silently or when adapters are present in config but their dependencies aren't installed (e.g. discord.py missing). Cron jobs and other gateway-runtime work would unnecessarily fail to start. Split: only return False when startup_retryable_errors is non-empty (real connection attempt failed). When the list is empty AND enabled > 0, log a warning and continue running, matching the 'no platforms enabled' cron path. Salvage of #22642's gateway slice. Drops the bundled run_agent.py memory-nudge counter hydration block (issue #22357 territory) which wasn't mentioned in the PR description. Closes #5196.	2026-05-09 17:53:46 -07:00
Wesley Simplicio	116a1446a4	fix(terminal): bridge docker_env config to TERMINAL_DOCKER_ENV Problem: terminal.docker_env set in config.yaml was silently ignored. Docker containers never received the user-specified env vars. Root cause: docker_env was missing from all three config→env bridging maps (cli.py env_mappings, gateway/run.py _terminal_env_map, hermes_cli/config.py _config_to_env_sync) and from the terminal_tool _get_env_config() reader. _create_environment() consumed the key from container_config correctly, but it was always {} because TERMINAL_DOCKER_ENV was never set. Also extend the list-serialisation branches in cli.py and gateway/run.py to handle dict values via json.dumps (lists already used json.dumps; plain str() on a dict produces undecodable output). Fix: - cli.py: add "docker_env": "TERMINAL_DOCKER_ENV" to env_mappings; serialise dict values with json.dumps alongside existing list path - gateway/run.py: same additions to _terminal_env_map and serialisation - hermes_cli/config.py: add "terminal.docker_env": "TERMINAL_DOCKER_ENV" to _config_to_env_sync so `hermes config set terminal.docker_env …` persists to .env correctly - tools/terminal_tool.py: add docker_env key to _get_env_config() reading TERMINAL_DOCKER_ENV via _parse_env_var with default "{}" Tests: add test_docker_env_is_bridged_everywhere to tests/tools/test_terminal_config_env_sync.py — stash-verified: fails on origin/main, passes with fix. Fixes #20537	2026-05-09 17:53:35 -07:00
Kid	1321bcf5fe	fix(gateway): finalize final stream edit on done	2026-05-09 17:51:04 -07:00
Teknium	70bfd429e5	fix(gateway): preserve reasoning_content, codex_message_items, finish_reason on transcript replay (#22839 ) PR #2974 whitelisted three reasoning fields (reasoning, reasoning_details, codex_reasoning_items) for the gateway's simple-text replay branch. Three more fields were added to the DB later but the whitelist was never updated: - reasoning_content: provider-facing thinking text. _copy_reasoning_content_for_api promotes 'reasoning' -> 'reasoning_content' at send time only when the strings happen to match. Carrying the original verbatim avoids loss for providers that return them as distinct fields (DeepSeek/Kimi/ Moonshot thinking modes), and preserves the empty-string sentinel that DeepSeek V4 Pro requires for thinking-mode replay. - codex_message_items: exact assistant message items with 'phase'. OpenAI docs: 'preserve and resend phase on all assistant messages — dropping it can degrade performance.' Required for prefix cache hits. No recovery path exists — once dropped, gone. - finish_reason: informational; cheap to keep so transcripts replay identically across CLI and gateway. The CLI is unaffected because cli.py keeps the live in-memory message list across turns (cli.py:10046 'self.conversation_history = result["messages"]'). The gateway rebuilds agent_history from the SQLite transcript on every turn, so any field stripped during replay is silently lost. Refactors the inline whitelist into a module-level _build_replay_entry() helper so the contract can be unit-tested. 16 new tests pin the field set and falsy-value handling. Verified end-to-end: DB stores all 8 fields, replay now preserves all 8 (was preserving only 5 for assistant text turns).	2026-05-09 14:47:33 -07:00
Teknium	448c11f16d	fix(telegram): default notifications to 'important' (silence intermediate) Per-tool-call push notifications on Telegram are noisy enough that 'all' is the wrong default — long agent runs spam the user's notification shade with status messages they didn't ask to be pinged about. Final responses, approval prompts, and slash confirmations still notify; intermediate progress, streaming, and tool-progress messages now deliver silently via disable_notification. Users who want the legacy behavior can opt back in with: display: platforms: telegram: notifications: all or HERMES_TELEGRAM_NOTIFICATIONS=all.	2026-05-09 13:38:25 -07:00
Denis	236f3b0521	feat(gateway): add Telegram notification mode to suppress intermediate push notifications Add a configurable notifications mode for the Telegram platform adapter that controls which messages trigger push notifications. - display.platforms.telegram.notifications: "all" (default) \| "important" - HERMES_TELEGRAM_NOTIFICATIONS env var override - In "important" mode, all sends use disable_notification=True except: - Approvals (send_exec_approval) and slash confirmations - Final response messages (metadata["notify"]=True) - Zero overhead in default "all" mode - Zero impact on non-Telegram platforms Closes #22771	2026-05-09 13:38:25 -07:00
Wesley Simplicio	a671d8a27a	fix(email): use real hermes version in IMAP ID command	2026-05-09 13:35:50 -07:00
Wesley Simplicio	3fd4ccbd8b	fix(email): send IMAP ID extension to support 163/NetEase mailbox 163/NetEase IMAP servers reject every UID SEARCH/FETCH with `BYE Unsafe Login` unless the client first identifies itself via the RFC 2971 ID command after LOGIN. Without this, the email gateway logs in OK but then fails on the very first poll and the connection is torn down. Send the ID payload best-effort after both `imap.login()` sites (`EmailAdapter.connect` and `_fetch_new_messages`). Failures are swallowed at debug level so non-supporting IMAP servers (Gmail, Outlook, Fastmail, Yahoo, etc.) keep working unchanged. Closes #22271	2026-05-09 13:35:50 -07:00
Teknium	ea2d66ddc0	perf(gateway): defer QQAdapter and YuanbaoAdapter imports via PEP 562 (#22790 ) `gateway/platforms/__init__.py` eagerly imported `QQAdapter` and `YuanbaoAdapter` at package-init time, which transitively pulled in qqbot's chunked-upload + keyboards + onboard machinery and yuanbao's websocket stack. About 84 ms wall and 23 MB RSS on every fresh process that touched anything under `gateway.platforms` — including `hermes chat` (via run_agent → cli's plugin discovery transitive import). Nothing in the codebase actually consumes these symbols from the package root; every real call site uses the long-form path (`from gateway.platforms.qqbot import QQAdapter`, `from gateway.platforms.yuanbao import YuanbaoAdapter` in gateway/run.py). The eager re-export was only there for convenience. Replace with a PEP 562 module-level `__getattr__` that lazily imports on first attribute access. Public API stays identical: `from gateway.platforms import QQAdapter` keeps working but only pays the import cost when the symbol is actually touched. `__dir__` preserves help() / autocomplete behavior. Measured impact (7-run medians, 9950X3D): import gateway.platforms 127 → 43 ms (-66%) 50 → 27 MB (-46%) import gateway.platforms.base 127 → 44 ms (-65%) 50 → 27 MB (-46%) import cli (full chat path) 745 → 710 ms ( -5%) 96 → 90 MB ( -6%) hermes chat -q (cold) -5 MB The per-import win is biggest because qqbot/yuanbao deps don't overlap with anything on the gateway-platforms path — full `import cli` already loads aiohttp/websockets transitively from other places, so the marginal CLI win is smaller than the isolated import benchmark. The `gateway.platforms.base` win is what matters most for long-lived gateway processes: every gateway boot saves 23 MB resident. All 144 qqbot tests pass; broader gateway suite (5132 tests) passes modulo 4 pre-existing flakes also failing on main without this change.	2026-05-09 13:17:48 -07:00
Teknium	2124ad72a2	fix(api-server): emit length/error finish_reason for truncation/failure (#22775 ) Non-streaming /v1/chat/completions wrapped any AIAgent result \u2014 including partial/failed runs \u2014 as a successful 200 with finish_reason='stop' and the internal failure string substituted into message.content. API clients had no way to distinguish 'agent answered: X' from 'agent crashed and the X you see is its error message'. After the fix: - completed: True \u2192 200 finish_reason='stop' (unchanged) - partial + truncated text \u2192 200 finish_reason='length' + hermes extras - partial + no text / failed \u2192 502 OpenAI error envelope (SDKs raise) - other failures \u2192 200 finish_reason='error' + hermes extras Adds X-Hermes-Completed / X-Hermes-Partial / X-Hermes-Error headers plus a 'hermes' extras object on partial responses for clients that want the full picture. Closes #22496.	2026-05-09 12:48:08 -07:00
kshitijk4poor	dae94fa652	fix: follow-up for salvaged PR #22263 - Restore allowed_chats gate before thread_id check so ignored_threads applies universally (even to guest mentions). - Compute _message_mentions_bot once in _should_process_message to eliminate redundant second entity scan when guest_mode=true and the message does not mention the bot. - Remove redundant _is_group_chat from _is_guest_mention (caller already verified the message is a group chat). - Update _telegram_allowed_chats docstring to note guest_mode exception. - Add test coverage: bot_command entity, text_mention entity, caption_entities, and ignored_threads + guest_mode interaction. - Add nik1t7n to AUTHOR_MAP.	2026-05-09 11:54:04 -07:00
Nikita Nosov	55f518e521	feat(gateway): add Telegram guest mention mode	2026-05-09 11:54:04 -07:00
Teknium	684fd14db0	fix(dingtalk): align override signatures with base + guard Optional[error] in tests	2026-05-09 11:11:10 -07:00
qWaitCrypto	c705c7ac9b	fix(dingtalk): clarify webhook media behavior	2026-05-09 11:11:10 -07:00
briandevans	854c2ce309	fix(telegram): honor message.quote for partial-quote reply context When a Telegram user replies using the native quote feature to select only part of a prior message, _build_message_event was injecting the ENTIRE replied-to message into reply_to_text via message.reply_to_message.text/caption. python-telegram-bot exposes the user-selected substring as message.quote (TextQuote.text); we now prefer that and fall back to the full replied-to text only when no native quote is present. The agent-visible "[Replying to: \"...\"]" prefix can otherwise expand the user's narrow quote into the full prior message, causing the agent to act on unrelated actionable-looking text the user did not select (e.g. multi-item briefings where the user quotes one bullet but the prefix injects every bullet). Falls back cleanly when message.quote is absent (PTB <21 or replies that don't quote a substring). Fixes #22619 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-09 11:10:36 -07:00
qWaitCrypto	124fbb0af0	fix(gateway): refresh runtime argv metadata	2026-05-09 11:08:23 -07:00
Teknium	b9c001116e	feat: confirm prompt for destructive slash commands (#4069 ) (#22687 ) /clear, /new, /reset, and /undo now ask the user to confirm before discarding conversation state — three-option prompt routed through the existing tools.slash_confirm primitive. Native yes/no buttons render on Telegram, Discord, and Slack (their adapters already implement send_slash_confirm); other platforms get a text-fallback prompt and reply with /approve, /always, or /cancel. The classic prompt_toolkit CLI uses the same three-option flow via the established _prompt_text_input pattern (see _confirm_and_reload_mcp). TUI keeps its existing modal overlay (#12312). Gated by new config key approvals.destructive_slash_confirm (default true). Picking 'Always Approve' flips the gate to false so subsequent destructive commands run silently — matches the established mcp_reload_confirm UX. Out of scope: /cron remove (separate domain — scheduled jobs, not session history). Existing TUI overlay env-var (HERMES_TUI_NO_CONFIRM) left unchanged; cosmetic unification can come later. Closes #4069.	2026-05-09 11:04:46 -07:00
ethernet	0cafe7d50d	Merge pull request #22510 from novax635/fix/gateway-slash-confirm-boundary-cleanup fix gateway: clear slash confirm state during session boundary cleanup	2026-05-09 12:48:49 -04:00
uzunkuyruk	8fdaf4d3d6	fix(telegram): exclude row-label column from bullet items in table rendering When a GFM table has a row-label column (first column with no header), _render_table_block_for_telegram incorrectly included the row-label cell in the bullet zip alongside the data cells, producing a spurious bullet like '• 維度: 核心賣點' before the real data rows. Detect the row-label column by comparing the first data row cell count against the header count (has_row_label_col = len(first_data_row) == len(headers) + 1). When present, use cells[0] as the heading and zip headers against cells[1:] only, correctly excluding the row-label from the bullet list. Fixes #22604	2026-05-09 17:39:16 +03:00
Nikita Nosov	1ac8deb3ca	feat(gateway): stream Telegram edits safely	2026-05-09 04:34:55 -07:00
novax635	8b6501786c	fix(gateway): clear slash-confirm state during session boundary cleanup	2026-05-09 14:18:20 +03:00
GodsBoy	93e25ceb13	feat(plugins): add standalone_sender_fn for out-of-process cron delivery Plugin platforms (IRC, Teams, Google Chat) currently fail with `No live adapter for platform '<name>'` when a `deliver=<plugin>` cron job runs in a separate process from the gateway, even though the platforms are eligible cron targets via `cron_deliver_env_var` (added in #21306). Built-in platforms (Telegram, Discord, Slack, etc.) use direct REST helpers in `tools/send_message_tool.py` so cron can deliver without holding the gateway in the same process; plugin platforms historically depended on `_gateway_runner_ref()` which returns `None` out of process. This change adds an optional `standalone_sender_fn` field to `PlatformEntry` so plugins can register an ephemeral send path that opens its own connection, sends, and closes without needing the live adapter. The dispatch site in `_send_via_adapter` falls through to the hook when the gateway runner is unavailable, with a descriptive error when neither path applies. The hook is optional, so existing plugins are unaffected. Reference migrations land in the same change for IRC, Teams, and Google Chat, exercising the hook across stdlib (asyncio + IRC protocol), Bot Framework OAuth client_credentials, and Google service-account flows respectively. Security hardening on the new code paths: * IRC: control-character stripping on chat_id and message body to block CRLF command injection; bounded nick-collision retries; JOIN before PRIVMSG so channels with the default `+n` mode accept the delivery. * Teams: TEAMS_SERVICE_URL validated against an allowlist of known Bot Framework hosts (`smba.trafficmanager.net`, `smba.infra.gov.teams.microsoft.us`) to block SSRF; chat_id and tenant_id constrained to the documented Bot Framework character set; per-request timeouts so a slow STS endpoint cannot starve the activity POST. * Google Chat: chat_id and thread_id validated against strict resource-name regexes; service-account refresh wrapped in `asyncio.wait_for` so a hung token endpoint cannot stall the scheduler. Test coverage: 20 new tests covering happy path, missing-config errors, network failure modes, and each defensive validation. Existing tests unchanged. `bash scripts/run_tests.sh tests/tools/test_send_message_tool.py tests/gateway/test_irc_adapter.py tests/gateway/test_teams.py tests/gateway/test_google_chat.py` reports 341 passed, 0 regressions. Documentation: new "Out-of-process cron delivery" section in website/docs/developer-guide/adding-platform-adapters.md and an entry in gateway/platforms/ADDING_A_PLATFORM.md naming the hook.	2026-05-09 02:56:29 -07:00
heathley	7e578f02c8	feat(feishu): add native update prompt cards	2026-05-09 02:32:55 -07:00
kshitij	2a7047c2ed	fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (#22043 ) SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl byte-range locks that don't reliably work on network filesystems. Upstream documents this explicitly: https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises 'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before this change, every feature backed by state.db or kanban.db broke silently: - /resume, /title, /history, /branch returned 'Session database not available.' with no cause - gateway logged the init failure at DEBUG (invisible in errors.log) - kanban dispatcher crashed every 60s, driving the known migration race (duplicate column name: consecutive_failures, #21708 / #21374) Changes: - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL and falls back to DELETE on SQLITE_PROTOCOL-style errors with one WARNING explaining why - hermes_state.get_last_init_error() + format_session_db_unavailable(): capture the init failure cause and surface it in user-facing strings (with an NFS/SMB pointer for 'locking protocol') - hermes_cli/kanban_db.connect(): use the shared helper - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING (matches cli.py's existing correct behavior) - cli.py (4 sites) + gateway/run.py (5 sites): replace bare 'Session database not available.' with format_session_db_unavailable() Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new test in tests/hermes_cli/test_kanban_db.py. Existing suites (state, kanban, gateway, cli) remain green for all tests unrelated to pre-existing failures on main. Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home, local_lock=none) reporting 'Session database not available.' on /resume; 'locking protocol' appears in 4 distinct log entries across backup, kanban, TUI, and CLI paths in the same session. closes #22032	2026-05-09 02:09:35 -07:00
kshitijk4poor	aef297a45e	fix(telegram): skip send_chat_action for DM topic reply-fallback lanes The send path uses Hermes' reply-anchor fallback for DM topic lanes (message_thread_id + reply_to_message_id), but send_chat_action only accepts message_thread_id — Telegram's Bot API 10.0 rejects it for these lanes. Without this short-circuit, every typing tick (~every 2s during agent runs) makes a doomed API call that gets logged as a 'thread not found' debug warning. Skip the call entirely when the metadata indicates a DM topic reply-fallback lane; the user-visible behavior is unchanged (no typing indicator either way for these lanes), but the logs stay clean. Identified during salvage review of #22053.	2026-05-09 01:39:37 -07:00
Jhin Lee	b3239572f0	fix(telegram): preserve DM topic routing via reply fallback	2026-05-09 01:39:37 -07:00
LeonSGP43	dccf1fb6e0	fix(gateway): cap adapter disconnect during stop	2026-05-08 18:50:25 -07:00
Teknium	26bac67ef9	fix(entry-points): guard hermes_bootstrap import so partial updates don't brick hermes (#22091 ) teknium1 hit ModuleNotFoundError: No module named 'hermes_bootstrap' after a code update, on both his Windows machine AND his Linux workstation. The failure mode is real and affects every user who updates hermes by any path OTHER than a fully-successful ``hermes update``. ## What happens hermes_bootstrap.py is a top-level module registered via pyproject.toml's ``py-modules`` list (added by Brooklyn's Windows UTF-8 stdio work). It must be registered in the venv's editable-install .pth file before Python can find it as a bare ``import hermes_bootstrap``. ``hermes update`` handles this correctly: (1) git reset --hard, (2) clear __pycache__, (3) uv pip install -e . (re-registers the package including the new py-modules list), (4) restart. BUT if any step AFTER (1) fails — network blip during pip install, PEP 668 on a system Python, venv locked, uv not in PATH, a crash mid-update — the user is left with new code that references hermes_bootstrap and a venv that doesn't know about it. Every hermes invocation after that crashes with ModuleNotFoundError, including ``hermes update`` itself. No recovery path without manual `uv pip install -e .`. Also affects users who ``git pull`` the repo directly without running hermes update — relatively common for developers. ## Fix Wrap ``import hermes_bootstrap`` in a try/except ModuleNotFoundError across all 6 entry points (hermes_cli/main, run_agent, gateway/run, acp_adapter/entry, cli, batch_runner). On Windows, missing bootstrap means the UTF-8 stdio setup doesn't run — degraded behavior (Unicode chars may fail to print) but NOT a crash. POSIX is unaffected either way since the bootstrap is a no-op there. Once hermes is running again, the user can ``hermes update`` to fully recover. ## Test update tests/test_hermes_bootstrap.py::test_entry_point_imports_bootstrap scans for the first top-level import in each entry point and asserts it is hermes_bootstrap. Extended the check to accept a Try block whose body is a lone Import of hermes_bootstrap — that's the recovery-friendly form we just introduced. Verified behavior by ``mv hermes_bootstrap.py hermes_bootstrap.py.bak`` and confirming ``python -c "import hermes_cli.main"`` succeeds. 82/82 tests pass (hermes_bootstrap + windows-native + windows-compat).	2026-05-08 14:43:13 -07:00
Teknium	cc38282b04	feat(cross-platform): psutil for PID/process management + Windows footgun checker ## Why Hermes supports Linux, macOS, and native Windows, but the codebase grew up POSIX-first and has accumulated patterns that silently break (or worse, silently kill!) on Windows: - `os.kill(pid, 0)` as a liveness probe — on Windows this maps to CTRL_C_EVENT and broadcasts Ctrl+C to the target's entire console process group (bpo-14484, open since 2012). - `os.killpg` — doesn't exist on Windows at all (AttributeError). - `os.setsid` / `os.getuid` / `os.geteuid` — same. - `signal.SIGKILL` / `signal.SIGHUP` / `signal.SIGUSR1` — module-attr errors at runtime on Windows. - `open(path)` / `open(path, "r")` without explicit encoding= — inherits the platform default, which is cp1252/mbcs on Windows (UTF-8 on POSIX), causing mojibake round-tripping between hosts. - `wmic` — removed from Windows 10 21H1+. This commit does three things: 1. Makes `psutil` a core dependency and migrates critical callsites to it. 2. Adds a grep-based CI gate (`scripts/check-windows-footguns.py`) that blocks new instances of any of the above patterns. 3. Fixes every existing instance in the codebase so the baseline is clean. ## What changed ### 1. psutil as a core dependency (pyproject.toml) Added `psutil>=5.9.0,<8` to core deps. psutil is the canonical cross-platform answer for "is this PID alive" and "kill this process tree" — its `pid_exists()` uses `OpenProcess + GetExitCodeProcess` on Windows (NOT a signal call), and its `Process.children(recursive=True)` + `.kill()` combo replaces `os.killpg()` portably. ### 2. `gateway/status.py::_pid_exists` Rewrote to call `psutil.pid_exists()` first, falling back to the hand-rolled ctypes `OpenProcess + WaitForSingleObject` dance on Windows (and `os.kill(pid, 0)` on POSIX) only if psutil is somehow missing — e.g. during the scaffold phase of a fresh install before pip finishes. ### 3. `os.killpg` migration to psutil (7 callsites, 5 files) - `tools/code_execution_tool.py` - `tools/process_registry.py` - `tools/tts_tool.py` - `tools/environments/local.py` (3 sites kept as-is, suppressed with `# windows-footgun: ok` — the pgid semantics psutil can't replicate, and the calls are already Windows-guarded at the outer branch) - `gateway/platforms/whatsapp.py` ### 4. `scripts/check-windows-footguns.py` (NEW, 500 lines) Grep-based checker with 11 rules covering every Windows cross-platform footgun we've hit so far: 1. `os.kill(pid, 0)` — the silent killer 2. `os.setsid` without guard 3. `os.killpg` (recommends psutil) 4. `os.getuid` / `os.geteuid` / `os.getgid` 5. `os.fork` 6. `signal.SIGKILL` 7. `signal.SIGHUP/SIGUSR1/SIGUSR2/SIGALRM/SIGCHLD/SIGPIPE/SIGQUIT` 8. `subprocess` shebang script invocation 9. `wmic` without `shutil.which` guard 10. Hardcoded `~/Desktop` (OneDrive trap) 11. `asyncio.add_signal_handler` without try/except 12. `open()` without `encoding=` on text mode Features: - Triple-quoted-docstring aware (won't flag prose inside docstrings) - Trailing-comment aware (won't flag mentions in `# os.kill(pid, 0)` comments) - Guard-hint aware (skips lines with `hasattr(os, ...)`, `shutil.which(...)`, `if platform.system() != 'Windows'`, etc.) - Inline suppression with `# windows-footgun: ok — <reason>` - `--list` to print all rules with fixes - `--all` / `--diff <ref>` / staged-files (default) modes - Scans 380 files in under 2 seconds ### 5. CI integration A GitHub Actions workflow that runs the checker on every PR and push is staged at `/tmp/hermes-stash/windows-footguns.yml` — not included in this commit because the GH token on the push machine lacks `workflow` scope. A maintainer with `workflow` permissions should add it as `.github/workflows/windows-footguns.yml` in a follow-up. Content: ```yaml name: Windows footgun check on: push: branches: [main] pull_request: branches: [main] jobs: check: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: {python-version: "3.11"} - run: python scripts/check-windows-footguns.py --all ``` ### 6. CONTRIBUTING.md — "Cross-Platform Compatibility" expansion Expanded from 5 to 16 rules, each with message, example, and fix. Recommends psutil as the preferred API for PID / process-tree operations. ### 7. Baseline cleanup (91 → 0 findings) - 14 `open()` sites → added `encoding='utf-8'` (internal logs/caches) or `encoding='utf-8-sig'` (user-editable files that Notepad may BOM) - 23 POSIX-only callsites in systemd helpers, pty_bridge, and plugin tool subprocess management → annotated with `# windows-footgun: ok — <reason>` - 7 `os.killpg` sites → migrated to psutil (see §3 above) ## Verification ``` $ python scripts/check-windows-footguns.py --all ✓ No Windows footguns found (380 file(s) scanned). $ python -c "from gateway.status import _pid_exists; import os > print('self:', _pid_exists(os.getpid())); print('bogus:', _pid_exists(999999))" self: True bogus: False ``` Proof-of-repro that `os.kill(pid, 0)` was actually killing processes before this fix — see commit ``1cbe39914`` and bpo-14484. This commit removes the last hand-rolled ctypes path from the hot liveness-check path and defers to the best-maintained cross-platform answer.	2026-05-08 14:27:40 -07:00
Teknium	324567c936	fix(windows): os.kill(pid, 0) is NOT a no-op on Windows — route through new _pid_exists helper On Windows, Python's ``os.kill(pid, 0)`` is NOT a no-op. CPython's implementation (``Modules/posixmodule.c::os_kill_impl``) treats sig=0 as ``CTRL_C_EVENT`` because the two integer values collide at the C layer, and routes it through ``GenerateConsoleCtrlEvent(0, pid)`` — which sends a Ctrl+C to the ENTIRE console process group containing the target PID, not just the PID itself. Any caller that wanted to check "is PID X alive" via the classic POSIX ``os.kill(pid, 0)`` idiom was silently killing that process (and often unrelated processes in the same console group) on Windows. Long-standing Python Windows quirk; see bpo-14484 (open since 2012). This manifested in Hermes as: every ``hermes gateway status`` invocation would read the gateway's PID from the PID file, call ``os.kill(pid, 0)`` via ``gateway.status.get_running_pid()`` as a "liveness check", and instantly terminate the gateway it was trying to report on. No shutdown log, no traceback, no atexit hook fire, no exit-diag entry — just silent termination of the detached pythonw process. "Bot answered one message then stopped typing" was the characteristic end-user symptom because `os.kill(pid, 0)` fires mid-response-send and kills the gateway between logs. Reproduction (verified in this branch before the fix): $ hermes gateway start # gateway alive, PID 37520 $ hermes gateway status # reports "No gateway process detected" $ tasklist /FI "PID eq 37520" # INFO: No tasks are running # — gateway terminated silently Root-cause fix is a new ``gateway.status._pid_exists(pid)`` helper: - On Windows: Win32 ``OpenProcess(PROCESS_QUERY_LIMITED_INFORMATION \| SYNCHRONIZE, False, pid)`` + ``WaitForSingleObject(handle, 0)`` via ctypes. Zero signal delivery, zero console-group side effects. Pins ctypes return types to avoid DWORD-vs-signed-int parse bugs on WAIT_TIMEOUT (0x102). Distinguishes ERROR_INVALID_PARAMETER (PID gone) from ERROR_ACCESS_DENIED (alive but another user). - On POSIX: the canonical ``os.kill(pid, 0)`` idiom that actually is a no-op there. Then patch every ``os.kill(pid, 0)`` liveness-check callsite to route through ``_pid_exists`` instead. Total 14 callsites across 11 files; every single one was a latent silent-kill on Windows: gateway/run.py:2810 — /restart watcher (inline subprocess) gateway/run.py:15195 — --replace wait loop gateway/status.py:572 — acquire_gateway_runtime_lock stale check gateway/status.py:828 — get_running_pid (THE killer for status) gateway/platforms/whatsapp.py:111 hermes_cli/gateway.py:228, 522, 1012 — gateway-related drain loops hermes_cli/kanban_db.py:2826 — _pid_alive was claiming to be cross-platform but used os.kill(pid, 0) on Windows hermes_cli/main.py:5792 — CLI process-kill polling hermes_cli/profiles.py:782 — profile stop wait loop plugins/google_meet/process_manager.py:74 tools/browser_tool.py:1215, 1255 — browser daemon ownership probes tools/mcp_tool.py:1255, 3374 — MCP stdio orphan tracking The watcher source in gateway/run.py:2810 is a multi-line string that gets spawned as an inline ``python -c "..."`` subprocess, so it can't import gateway.status. The fix for that callsite inlines the same ctypes probe directly into the watcher source. Tested on Windows 10 with the hermes gateway + Telegram bot: - gateway start → alive - 5 consecutive ``hermes gateway status`` invocations → gateway alive after every one, same PID reported each time (37520, 21952) - gateway.log shows uninterrupted operation; no spurious shutdown entries; cron ticker and kanban dispatcher still running on their 60-second cadence - bot continues answering Telegram messages throughout Ships alongside an exit-path diagnostic wrapper in ``hermes_cli/gateway.py::run_gateway()`` that captures every way ``asyncio.run(start_gateway(...))`` can return (success, SystemExit, KeyboardInterrupt, BaseException, atexit) with full traceback to ``logs/gateway-exit-diag.log``. This was used to prove the gateway was being hard-killed externally (no exit event fired) and should be kept for future Windows debugging. Refs: https://bugs.python.org/issue14484 See also: references/windows-subprocess-sigint-storm.md in the hermes-agent skill.	2026-05-08 14:27:40 -07:00
Teknium	cbce5e93fc	codebase: add encoding='utf-8' to all bare open() calls (PLW1514) Closes the last Python-on-Windows UTF-8 exposure by making every text-mode open() call explicit about its encoding. Before: on Windows, bare open(path, 'r') defaults to the system locale encoding (cp1252 on US-locale installs). That means reading any config/yaml/markdown/json file with non-ASCII content either crashes with UnicodeDecodeError or silently mis-decodes bytes. After: all 89 affected call sites in production code now pass encoding='utf-8' explicitly. Works identically on every platform and every locale, no surprise behavior. Mechanical sweep via: ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' . All 89 fixes have the same shape: open(x) or open(x, mode) became open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing else changed. Every modified file still parses and the Windows/sandbox test suite is still green (85 passed, 14 skipped, 0 failed across tests/tools/test_code_execution_windows_env.py + tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py + tests/test_hermes_bootstrap.py). Scope notes: - tests/ excluded: test fixtures can use locale encoding intentionally (exercising edge cases). If we want to tighten tests later that's a separate PR. - plugins/ excluded: plugin-specific conventions may differ; plugin authors own their code. - optional-skills/ and skills/ excluded: skill scripts are user-authored and we don't want to mass-edit them. - website/ and tinker-atropos/ excluded: vendored / generated content. 46 files touched, 89 +/- lines (symmetric replacement). No behavior change on POSIX or on Windows when the file is ASCII; bug fix on Windows when the file contains non-ASCII.	2026-05-08 14:27:40 -07:00
Teknium	d94fb47717	hermes_bootstrap: Windows-only UTF-8 stdio shim for all entry points Codebase-wide fix for Python-on-Windows UTF-8 footguns, complementing the earlier execute_code sandbox fixes (which remain load-bearing for when the sandbox explicitly scrubs child env). Problem: Python on Windows has two long-standing text-encoding pitfalls: 1. sys.stdout/stderr are bound to the console code page (cp1252 on US-locale installs) — print('café') crashes with UnicodeEncodeError. 2. Subprocess children don't know to use UTF-8 unless PYTHONUTF8 and/or PYTHONIOENCODING are set in their env — so any Python we spawn (linters, sandbox children, delegation workers) hits the same bug. Solution: A tiny bootstrap module (hermes_bootstrap.py) imported as the first statement of every Hermes entry point: - hermes_cli/main.py (hermes / hermes-agent console_script) - run_agent.py (hermes-agent direct) - acp_adapter/entry.py (hermes-acp) - gateway/run.py (messaging gateway) - batch_runner.py (parallel batch mode) - cli.py (legacy direct-launch CLI) On Windows, the bootstrap: - os.environ.setdefault('PYTHONUTF8', '1') (PEP 540 UTF-8 mode) - os.environ.setdefault('PYTHONIOENCODING', 'utf-8') - sys.stdout/stderr/stdin.reconfigure(encoding='utf-8', errors='replace') Children inherit the env vars → they run in UTF-8 mode. Current process's stdio is reconfigured → print('café') works now. On POSIX (Linux/macOS), the bootstrap is a complete no-op. We don't touch LANG, LC_, or anything else — users who have intentionally configured a non-UTF-8 locale aren't affected. POSIX systems are already UTF-8 by default in 99% of modern setups, so there's nothing to fix. setdefault() (not overwrite) means users who explicitly set PYTHONUTF8=0 or PYTHONIOENCODING=cp1252 in their environment are respected. What this does NOT fix: bare open(path, 'w') calls in the parent* process still default to locale encoding because PYTHONUTF8 is only read at interpreter init. A ruff PLW1514 sweep (separate follow-up) will add explicit encoding='utf-8' at those ~219 call sites for belt-and-suspenders. Tests (17): 16 passed, 1 skipped on Windows. - Windows: env vars set, stdio reconfigured, child inherits UTF-8 mode - POSIX: complete no-op (verified on fake POSIX + skipped on real POSIX since we don't have a Linux box in this session) - Idempotence: multiple calls safe - Graceful degradation: non-reconfigurable streams don't crash - User opt-out: explicit PYTHONUTF8=0 is respected - Load order: every entry point's FIRST top-level import is hermes_bootstrap, enforced by an AST-level parametrized test pyproject.toml: added hermes_bootstrap to py-modules so it ships with pip installs.	2026-05-08 14:27:40 -07:00
Teknium	e93bfc6c93	feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags Second pass on native Windows support, driven by a systematic audit across five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG, os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns (npm.cmd batch shims, start_new_session no-op on Windows), subsystem health (cron, gateway daemon, update flow), and module-level import guards. Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux, test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix. ## New module - hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command, windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs). All no-ops on non-Windows. ## CRITICAL fixes (would crash or silently break on Windows) - tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would AttributeError on import on Windows, breaking `hermes --tui` entirely (it spawns this module as a subprocess). Guard each signal.signal() call with hasattr() and add SIGBREAK as Windows' SIGHUP equivalent. - hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was unguarded. os.WNOHANG doesn't exist on Windows. Gate the whole reap loop behind `os.name != "nt"` — Windows has no zombies anyway. - tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on most Windows builds. Fall back to loopback TCP (AF_INET on 127.0.0.1:0 ephemeral port) when _IS_WINDOWS. HERMES_RPC_SOCKET env var now accepts either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows). Generated sandbox client parses both. - cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded. Use shutil.which("bash") so Windows (Git Bash via MinGit) works, with a readable error when bash is genuinely absent. - 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py (npm install + node version probe), browser_tool.py x2. On Windows npm is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...]) fails with WinError 193. shutil.which(...) returns the absolute .cmd path which CreateProcessW accepts because the extension routes through cmd.exe /c. POSIX behaviour unchanged (shutil.which still returns the same path subprocess would resolve itself). ## HIGH fixes (silent misbehaviour on Windows) - tools/environments/local.py get_temp_dir: hardcoded /tmp returned on Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote via MSYS2's virtual /tmp but native Python couldn't open. Result: cwd tracking silently broken — `cd` in terminal tool did nothing. Windows branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes (works in both bash and Python, guaranteed no spaces). - tools/environments/local.py _make_run_env PATH injection: `/usr/bin not in split(":")` heuristic mangles Windows PATH (";" separator). Gate the injection behind `not _IS_WINDOWS`. - hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer Popen + watcher-script Popen both used start_new_session=True, which Windows silently ignores. Watcher stayed attached to CLI's console, died when user closed terminal after `hermes update`, left gateway stale. Now branches through windows_detach_popen_kwargs() helper (CREATE_NEW_PROCESS_GROUP \| DETACHED_PROCESS \| CREATE_NO_WINDOW on Windows, start_new_session=True on POSIX — identical to main). ## MEDIUM fixes - gateway/run.py /restart and /update handlers: hardcoded bash/setsid chain crashes on Windows when user triggers /update in-gateway. Now has sys.platform=="win32" branch using sys.executable + a tiny Python watcher with proper detach flags. POSIX path is unchanged. - cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/... style paths that break subprocess.Popen(cwd=...) and Path().resolve(). Added _normalize_git_bash_path() helper that translates /c/Users, /cygdrive/c, /mnt/c variants to native C:\Users form. POSIX no-op. _git_repo_root() now routes every result through it. - cli.py worktree .worktreeinclude: os.symlink on directories failed hard on Windows (requires admin or Developer Mode). Falls back to shutil.copytree with a warning log. ## Tests - 29 new tests in tests/tools/test_windows_native_support.py covering: subprocess_compat helpers, TUI entry signal guards, kanban waitpid guard, code_execution TCP fallback source-level invariants, cron bash resolution, npm/npx bare-spawn lint per-file, local env Windows temp dir, PATH injection gating, git bash path normalization, symlink fallback, gateway detached watcher flags. - One existing test assertion adjusted in test_browser_homebrew_paths: it compared captured Popen argv to the BARE `"npx"` literal; after the shutil.which() change argv[0] is the absolute path. New assertion checks the shape (two items, second is `agent-browser`) rather than the exact first-item string. Behaviour unchanged; test was too strict. All 56 tests pass on Linux (30 from previous commits + 26 new). 267 tests from the affected files/dirs (browser, code_exec, local_env, process_registry, kanban_db, windows_compat) all pass — zero regressions. tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged; all pre-existing test failures confirmed unrelated via `git stash` re-run. ## What's still deferred (LOW priority) - Visible cmd-window flashes on short-lived console apps (~14 sites) — cosmetic, needs a follow-up pass once we have user reports. - agent/file_safety.py POSIX-only security deny patterns — separate hardening task. - tools/process_registry.py returning "/tmp" as fallback — theoretical; reachable only when all env-var candidates fail.	2026-05-08 14:27:40 -07:00
Teknium	9de893e3b0	feat(windows): close native-Windows install gaps — crash-free startup, UTF-8 stdio, tzdata dep, docs Native Windows (with Git for Windows installed) can now run the Hermes CLI and gateway end-to-end without crashing. install.ps1 already existed and the Git Bash terminal backend was already wired up — this PR fills the remaining gaps discovered by auditing every Windows-unsafe primitive (`signal.SIGKILL`, `os.kill(pid, 0)` probes, bare `fcntl`/`termios` imports) and by comparing hermes against how Claude Code, OpenCode, Codex, and Cline handle native Windows. ## What changed ### UTF-8 stdio (new module) - `hermes_cli/stdio.py` — single `configure_windows_stdio()` entry point. Flips the console code page to CP_UTF8 (65001), reconfigures `sys.stdout`/`stderr`/`stdin` to UTF-8, sets `PYTHONIOENCODING` + `PYTHONUTF8` for subprocesses. No-op on non-Windows. Opt out via `HERMES_DISABLE_WINDOWS_UTF8=1`. - Called early in `cli.py::main`, `hermes_cli/main.py::main`, and `gateway/run.py::main` so Unicode banners (box-drawing, geometric symbols, non-Latin chat text) don't `UnicodeEncodeError` on cp1252 consoles. ### Crash sites fixed - `hermes_cli/main.py:7970` (hermes update → stuck gateway sweep): raw `os.kill(pid, _signal.SIGKILL)` → `gateway.status.terminate_pid(pid, force=True)` which routes through `taskkill /T /F` on Windows. - `hermes_cli/profiles.py::_stop_gateway_process`: same fix — also converted SIGTERM path to `terminate_pid()` and widened OSError catch on the intermediate `os.kill(pid, 0)` probe. - `hermes_cli/kanban_db.py:2914, 3041`: raw `signal.SIGKILL` → `getattr(signal, "SIGKILL", signal.SIGTERM)` fallback (matches the pattern already used in `gateway/status.py`). ### OSError widening on `os.kill(pid, 0)` probes Windows raises `OSError` (WinError 87) for a gone PID instead of `ProcessLookupError`. Widened the catch at: - `gateway/run.py:15101` (`--replace` wait-for-exit loop — without this, the loop busy-spins the full 10s every Windows gateway start) - `hermes_cli/gateway.py:228, 460, 940` - `hermes_cli/profiles.py:777` - `tools/process_registry.py::_is_host_pid_alive` - `tools/browser_tool.py:1170, 1206` ### Dashboard PTY graceful degradation `hermes_cli/pty_bridge.py` depends on `fcntl`/`termios`/`ptyprocess`, none of which exist on native Windows. Previously a Windows dashboard would crash on `import hermes_cli.web_server` because of a top-level import. Now: - `hermes_cli/web_server.py` wraps the pty_bridge import in `try/except ImportError` and sets `_PTY_BRIDGE_AVAILABLE=False`. - The `/api/pty` WebSocket handler returns a friendly "use WSL2 for this tab" message instead of exploding. - Every other dashboard feature (sessions, jobs, metrics, config editor) runs natively on Windows. ### Dependency - `pyproject.toml`: add `tzdata>=2023.3; sys_platform == 'win32'` so Python's `zoneinfo` works on Windows (which has no IANA tzdata shipped with the OS). Credits @sprmn24 (PR #13182). ### Docs - README.md: removed "Native Windows is not supported"; added PowerShell one-liner and Git-for-Windows prerequisite note. - `website/docs/getting-started/installation.md`: new Windows section with capability matrix (everything native except the dashboard `/chat` PTY tab, which is WSL2-only). - `website/docs/user-guide/windows-wsl-quickstart.md`: reframed as "WSL2 as an alternative to native" rather than "the only way". - `website/docs/developer-guide/contributing.md`: updated cross-platform guidance with the `signal.SIGKILL` / `OSError` rules we enforce now. - `website/docs/user-guide/features/web-dashboard.md`: acknowledged native Windows works for everything except the embedded PTY pane. ## Why this shape Pulled from a survey of how other agent codebases handle native Windows (Claude Code, OpenCode, Codex, Cline): - All four treat Git Bash as the canonical shell on Windows, same as hermes already does in `tools/environments/local.py::_find_bash()`. - None of them force `SetConsoleOutputCP` — but they don't have to, Node/Rust write UTF-16 to the Win32 console API. Python does not get that for free, so we flip CP_UTF8 via ctypes. - None of them ship PowerShell-as-primary-shell (Claude Code exposes PS as a secondary tool; scope creep for this PR). - All of them use `taskkill /T /F` for force-kill on Windows, which is exactly what `gateway.status.terminate_pid(force=True)` does. ## Non-goals (deliberate scope limits) - No PowerShell-as-a-second-shell tool — worth designing separately. - No terminal routing rewrite (#12317, #15461, #19800 cluster) — that's the hardest design call and needs a separate doc. - No wholesale `open()` → `open(..., encoding="utf-8")` sweep (Tianworld cluster) — will do as follow-up if users hit actual breakage; most modern code already specifies it. ## Validation - 28 new tests in `tests/tools/test_windows_native_support.py` — all platform-mocked, pass on Linux CI. Cover: - `configure_windows_stdio` idempotency, opt-out, env-preservation - `terminate_pid` taskkill routing, failure → OSError, FileNotFoundError fallback - `getattr(signal, "SIGKILL", …)` fallback shape - `_is_host_pid_alive` OSError widening (Windows-gone-PID behavior) - Source-level checks that all entry points call `configure_windows_stdio` - pty_bridge import-guard present in `web_server.py` - README no longer says "not supported" - 12 pre-existing tests in `tests/tools/test_windows_compat.py` still pass. - `tests/hermes_cli/` ran fully (3909 passed, 9 failures — all confirmed pre-existing on main by stash-test). - `tests/gateway/` ran fully (5021 passed, 1 pre-existing failure). - `tests/tools/test_process_registry.py` + `test_browser_*` pass. - Manual smoke: `import hermes_cli.stdio; import gateway.run; import hermes_cli.web_server` — all clean, `_PTY_BRIDGE_AVAILABLE=True` on Linux (as expected). ## Files - New: `hermes_cli/stdio.py`, `tests/tools/test_windows_native_support.py` - Modified: `cli.py`, `gateway/run.py`, `hermes_cli/main.py`, `hermes_cli/profiles.py`, `hermes_cli/gateway.py`, `hermes_cli/kanban_db.py`, `hermes_cli/pty_bridge.py`, `hermes_cli/web_server.py`, `tools/browser_tool.py`, `tools/process_registry.py`, `pyproject.toml`, `README.md`, and 4 docs pages. Credits to everyone whose prior PR work informed these fixes — see the co-author trailers. All of the PRs listed in `~/.hermes/plans/windows-support-prs.md` fixing `os.kill` / `signal.SIGKILL` / UTF-8 stdio / tzdata / README patterns found the same issues; this PR consolidates them. Co-authored-by: Philip D'Souza <9472774+PhilipAD@users.noreply.github.com> Co-authored-by: Arecanon <42595053+ArecaNon@users.noreply.github.com> Co-authored-by: XiaoXiao0221 <263113677+XiaoXiao0221@users.noreply.github.com> Co-authored-by: Lars Hagen <1360677+lars-hagen@users.noreply.github.com> Co-authored-by: Luan Dias <65574834+luandiasrj@users.noreply.github.com> Co-authored-by: Ruzzgar <ruzzgarcn@gmail.com> Co-authored-by: sprmn24 <oncuevtv@gmail.com> Co-authored-by: adybag14-cyber <252811164+adybag14-cyber@users.noreply.github.com> Co-authored-by: Prasanna28Devadiga <54196612+Prasanna28Devadiga@users.noreply.github.com>	2026-05-08 14:27:40 -07:00
Dilee	07bbd93337	feat(teams-pipeline): add plugin runtime and operator cli Third slice of the Microsoft Teams meeting pipeline stack, salvaged onto current main. Adds the standalone teams_pipeline plugin that consumes Graph change notifications from the webhook listener, resolves meeting artifacts (transcript first, recording + STT fallback later), persists job state in a durable store, and exposes an operator CLI for inspection, replay, subscription management, and validation. Design choices follow maintainer review feedback on PR #19815: - Standalone plugin rather than bolted-on core surface (plugins/teams_pipeline/, kind: standalone in plugin.yaml). - Zero new model tools. The agent drives the pipeline by invoking the operator CLI via the terminal tool, guided by the skill that ships with a follow-up PR. - Reuses the existing msgraph_webhook gateway platform for Graph ingress. Pipeline runtime is wired in via bind_gateway_runtime and gated on plugins.enabled so gateways that don't run the plugin boot cleanly. Additions: - plugins/teams_pipeline/: runtime (gateway wiring + config builder), pipeline core, durable SQLite store, subscription maintenance helpers, Graph artifact resolution, operator CLI (list, show, run/replay, fetch dry-run, subscriptions list, subscribe, renew-subscription, delete-subscription, maintain-subscriptions, token-health, validate). - hermes_cli/main.py: second-pass plugin CLI discovery so any standalone plugin registered via ctx.register_cli_command() outside the memory-plugin convention path gets its subcommand wired into argparse without touching core. - gateway/run.py: _teams_pipeline_plugin_enabled() config gate, _wire_teams_pipeline_runtime() binding after adapter setup, and the two runner attributes used by the runtime. Credit to @dlkakbs for the entire plugin implementation.	2026-05-08 11:18:14 -07:00
Teknium	b8d7e0e6d3	fix(msgraph_webhook): harden auth surface + IP allowlisting + response hygiene Defense-in-depth polish on top of the webhook listener before it becomes a real attack surface once the pipeline starts creating subscriptions and Graph starts POSTing to the configured public URL. - Timing-safe clientState comparison. Previously used `==` on strings; switches to hmac.compare_digest so a mismatch does not leak how many leading characters matched. client_state is documented as a strong shared secret (openssl rand -hex 32 in the setup docs), so a timing-safe primitive is the right call. - Split GET and POST handlers. Graph validates a subscription by sending GET with validationToken in the query; anything else on GET is now a 400 so the endpoint cannot be probed or mistakenly used for data exfil. Previously a bare GET fell through to the POST path and blew up on request.json() with a confusing 400. - Empty response bodies on success. 202 is returned with no body so internal counters (accepted / duplicates / scheduled) do not leak to any caller that can reach the endpoint; counters remain observable via /health for operators. 403 on every-item-bad-clientState batches (so forged POSTs stop retrying), 400 on malformed / unknown-resource batches (sender configuration issue). - Optional source-IP allowlist. New `allowed_source_cidrs` extra field (list or comma-separated string) and `MSGRAPH_WEBHOOK_ALLOWED_SOURCE_CIDRS` env var let operators restrict the webhook to Microsoft Graph's published webhook source ranges in production. Empty = allow all, preserving dev-tunnel / localhost workflows. Invalid CIDRs are logged and ignored rather than crashing. Also gates the handshake endpoint so disallowed IPs cannot probe it. - Tests updated for the new response contract (empty-body 202, auth-only 403, config-error 400) and extended to cover: bare GET rejection, POST-with-validationToken handshake tolerance, timing-safe compare actually invoked via hmac.compare_digest spy, malformed body / missing value array, IP allowlist accept/reject paths, handshake IP allowlist, invalid CIDR entries, comma-string CIDR list parsing. 52/52 passed (was 40). Full gateway suite: 5049 passed / 1 pre-existing failure in test_discord_free_response (unrelated, reproduces on clean origin/main).	2026-05-08 10:29:58 -07:00
Dilee	26a59e4f6c	fix(msgraph): normalize webhook dedupe and resource matching	2026-05-08 10:29:58 -07:00
Dilee	2a215de9af	fix(msgraph): bound webhook receipt dedupe cache	2026-05-08 10:29:58 -07:00
Dilee	46a6f39024	feat(msgraph): add webhook listener platform	2026-05-08 10:29:58 -07:00
Zhicheng Han	526c0e018a	feat(api-server): expose run approval events	2026-05-08 07:30:14 -07:00
JC	03ddff8897	fix(gateway): defer goal status notices until after response delivery Route goal status notices through the platform adapter send API and register post-delivery callbacks so completed-goal notices appear after the final assistant response. Also cancel queued synthetic goal continuations on /goal pause and /goal clear while preserving normal queued user messages.	2026-05-07 17:33:09 -07:00
Teknium	2564132a1f	fix(telegram): preserve thread_id=1 for forum General typing indicator (#21390 ) The May 5 refactor in `d5357f816` made _message_thread_id_for_typing() symmetric with _message_thread_id_for_send() by mapping the General topic (thread id "1") to None upfront for both. That's correct for sendMessage — Telegram rejects message_thread_id=1 on sends and the topic must be omitted — but it's wrong for sendChatAction. Observed behavior (confirmed via before/after Telegram wire traces): Before `d5357f816`: thread_id=1 → message_thread_id=1 → bubble visible in General After `d5357f816`: thread_id=1 → message_thread_id=None → no visible typing Omitting message_thread_id on sendChatAction does NOT fall back to the General topic's view in a forum-enabled supergroup; the bubble ends up hidden from the client's General-topic pane entirely. For any user on a forum-group, the typing indicator stopped appearing. Fix: drop the symmetric "1 → None" mapping from the typing resolver. sendMessage still maps 1 → None via _message_thread_id_for_send (that side was never broken). The asymmetry is real and required by Telegram's API — document it in the resolver docstring. Partial revert of `d5357f816`; restores the behavior from `0cf7d570e` ("fix(telegram): restore typing indicator and thread routing for forum General topic"). Does not re-introduce the retry-without-thread fallback that `41545f7ec` scoped down for DM topics — with the resolver fixed, the first call already hits the right wire shape. Test updated from test_send_typing_general_topic_uses_none_thread_id (which encoded the broken contract) to test_send_typing_preserves_general_topic_thread_id, asserting the single correct call with message_thread_id=1. 10 other tests in the file untouched and passing.	2026-05-07 08:39:21 -07:00
WideLee	4de3ef38b1	feat(qqbot): wire native tool-approval UX via inline keyboards Makes the in-tree QQ inline keyboards actually light up when the agent blocks on a dangerous-command approval. Matches the cross-adapter gateway contract already implemented by Discord, Telegram, Slack, Matrix, and Feishu. Gateway/run.py's _approval_notify_sync checks type(adapter).send_exec_approval and falls back to a text prompt when it's missing. Without this wiring, QQ users stared at plain '/approve' text even though the adapter shipped button primitives. ### send_exec_approval(chat_id, command, session_key, description, metadata) Matches the signature the gateway calls with. Builds an ApprovalRequest (command_preview, description, timeout) and delegates to send_approval_request. Uses the last inbound msg_id as reply_to so QQ accepts the passive message. The 'metadata' parameter is accepted for contract parity but intentionally unused — QQ doesn't have thread_id/DM-targeting overrides. ### send_update_prompt(chat_id, prompt, default, session_key, metadata) Signature updated to match the cross-adapter contract used by 'hermes update --gateway' watcher. Renders a 'Update Needs Your Input' prompt with the optional default hint and a Yes/No keyboard. Replaces the earlier 3-arg helper that wasn't wired anywhere. ### Default interaction dispatcher _default_interaction_dispatch() auto-registered as the adapter's interaction callback in __init__. Routes: - approve:<session_key>:<decision> → tools.approval.resolve_gateway_approval Button → choice mapping: allow-once → 'once' allow-always → 'always' deny → 'deny' (QQ's 3-button mobile layout deliberately collapses 'session' + 'always' into one button; /approve session text fallback remains available.) - update_prompt:<answer> → atomic write of y/n to ~/.hermes/.update_response (the detached 'hermes update --gateway' watcher polls this file) - anything else → logged and dropped Resolve exceptions are caught and logged — never propagate into the WS loop. Callers can override via set_interaction_callback() to route clicks elsewhere or pass None to drop them entirely. ### Net effect QQ users now get native tap-to-approve UX on dangerous-command prompts and update-confirmation prompts, without having to type /approve or /deny as text. The adapter hooks into tools.approval the same way every other button-capable platform does. ### Tests 14 new tests cover: - Default callback installed on __init__ - send_exec_approval / send_update_prompt exist as class methods (so the gateway's type-probe detects them) - allow-once/always/deny each map to the correct resolve choice - update_prompt:y / update_prompt:n each write atomically to the response file (via monkeypatched get_hermes_home) - Unknown button_data / empty button_data / resolve exceptions are harmless - send_exec_approval honours last_msg_id reply-to and accepts metadata - send_update_prompt delegates with correct content + keyboard Full qqbot suite: 144 passed (72 pre-existing + 72 from this salvage arc). Also ran tools/test_approval.py alongside — no regressions (276 passed combined). Co-authored-by: WideLee <limkuan24@gmail.com>	2026-05-07 07:48:15 -07:00
teknium1	898b6d7d55	fix(webhook): widen INSECURE_NO_AUTH loopback check + tests + docs Follow-up to the previous commit: - Add _is_loopback_host() helper covering 127.0.0.1, localhost, ::1, ip6-localhost, ip6-loopback (case-insensitive). Empty/None host is treated as non-loopback since unset usually means public default bind. - Fix mixed-indent comment in the safety rail (comment now aligned with the if-block) and collapse the nested-if into one condition. - Add TestInsecureNoAuthSafetyRail covering rejection on 0.0.0.0, a LAN IP, and empty host; allowance on 127.0.0.1/localhost; plus unit-level parametrized coverage of _is_loopback_host for spellings we can't bind in the hermetic test env (::1, ip6-localhost, ip6-loopback). - Pin test_connect_starts_server + test_webhook_deliver_only defaults to 127.0.0.1 so they keep passing under the new rail. - Document the behavior in website/docs/user-guide/messaging/webhooks.md.	2026-05-07 07:38:43 -07:00
0z!	fb4f953569	fix: block INSECURE_NO_AUTH on non-localhost webhook bindings	2026-05-07 07:38:43 -07:00
Teknium	5c08b851df	docs(platforms): document env_enablement_fn + cron_deliver_env_var hooks (#21331 ) Following PR #21306 which added the new generic plugin-platform hooks, update the three platform-authoring docs so plugin authors find them: - website/docs/developer-guide/adding-platform-adapters.md: expand the 'What the Plugin System Handles Automatically' table with env-only auto-enable + cron delivery + hermes-config UI entries rows. Add three new sections — 'Env-Driven Auto-Configuration', 'Cron Delivery', 'Surfacing Env Vars in hermes config' — covering the hook signatures, plugin.yaml rich-dict format, and the home_channel-key special case. Update the main register() example to pass env_enablement_fn + cron_deliver_env_var inline so readers see them on their first pass. Upgrade the PLUGIN.yaml snippet to show bare-string + rich-dict + optional_env. - website/docs/guides/build-a-hermes-plugin.md: the thin platform example in the build-a-plugin tour now includes env_enablement_fn and cron_deliver_env_var, plus an optional_env block in the inline plugin.yaml. Keeps pointing to the developer-guide page for the full treatment. - gateway/platforms/ADDING_A_PLATFORM.md: the in-repo reference shallow-points at the docsite but now names the three new hooks explicitly so contributors reading the source tree know what they're for. Also adds teams + google_chat as reference implementations alongside irc.	2026-05-07 07:36:42 -07:00
WideLee	5b121c6e35	feat(qqbot): process attachments in quoted (reply) messages When a user replies while quoting another message, QQ sets 'message_type = 103' and pushes the referenced message's content + attachments inside 'msg_elements[0]'. The old adapter ignored msg_elements entirely, so: - Bare quote-replies (no user text) surfaced nothing to the LLM. - Quoted images/files/voice were never downloaded or described. - Quoted voice messages specifically produced no transcript — the model had no way to see what the user was referring to when saying 'about this voice note…'. This commit adds _process_quoted_context(d) which extracts msg_elements, unions their attachments, and runs them through the SAME _process_attachments pipeline as the main message body. Quoted voice gets an STT transcript (tried via QQ's asr_refer_text first, then the configured STT provider); quoted images get cached just like main-body images; quoted files surface with their original filename intact (not the CDN URL hash). The quoted content is prepended to the user's text as a '[Quoted message]:' block so the LLM sees the full referential context on one turn. Images-only quotes surface a '[Quoted message]: (image)' marker so the model knows an image was referenced even if no text came with it. All four inbound handlers (_handle_c2c_message, _handle_group_message, _handle_guild_message, _handle_dm_message) now call the helper uniformly — one merge pattern, not four divergent implementations. Filename preservation is carried by _process_attachments' existing '[Attachment: {filename or ct}]' line; nothing else needed for that. 12 new tests under TestProcessQuotedContext and TestMergeQuoteInto cover: - Non-quote messages short-circuit to empty - message_type=103 with no msg_elements is harmless - Text-only quotes render with '[Quoted message]:' prefix - Voice attachments in the quote flow through STT - File attachments in the quote preserve the original filename - Image attachments surface cached paths + media types - Images-only quote still emits a marker - Multiple msg_elements are concatenated - Malformed message_type values return empty - _merge_quote_into prepends with a blank-line separator Full qqbot suite: 130 passed (72 existing + 19 chunked + 27 keyboards + 12 quoted). Co-authored-by: WideLee <limkuan24@gmail.com>	2026-05-07 07:36:30 -07:00
WideLee	de584cd1dd	feat(qqbot): add inline-keyboard approvals and update prompts The QQ Bot v2 API supports inline keyboards on outbound messages. When a user taps a button, the platform dispatches an INTERACTION_CREATE gateway event; the bot ACKs it via PUT /interactions/{id} and decodes the button's data payload to route the click. This commit adds: New module gateway/platforms/qqbot/keyboards.py - Inline-keyboard dataclasses (InlineKeyboard, KeyboardRow, KeyboardButton, KeyboardButtonAction, KeyboardButtonRenderData, KeyboardButtonPermission) that serialize to the JSON shape the QQ API expects. - build_approval_keyboard(session_key) — 3-button layout: ✅ 允许一次 / ⭐ 始终允许 / ❌ 拒绝, all sharing group_id='approval' so clicking one greys out the rest. - build_update_prompt_keyboard() — Yes/No keyboard for update confirms. - parse_approval_button_data() / parse_update_prompt_button_data() — decode the button_data payload from INTERACTION_CREATE. approve:<session_key>:<decision> (decision = allow-once\|allow-always\|deny) update_prompt:<answer> (answer = y\|n) - build_approval_text(ApprovalRequest) — markdown renderer for the surrounding message body (exec-approval and plugin-approval variants, with severity icons 🔴/🔵/🟡). - parse_interaction_event(raw) → InteractionEvent dataclass — normalizes the nested raw payload (id / scene / openids / button_data / etc.). Adapter changes (gateway/platforms/qqbot/adapter.py) - _dispatch_payload routes INTERACTION_CREATE → _on_interaction. - _on_interaction parses the event, ACKs via PUT /interactions/{id}, then invokes a user-registered interaction callback. Exceptions from the callback are caught and logged (never propagate into the WS loop). - set_interaction_callback(cb) lets gateway wiring register a routing handler that inspects button_data and resolves the corresponding pending approval / update prompt. - _send_c2c_text / _send_group_text now accept an optional keyboard kwarg and append it to the outbound body. - send_with_keyboard(chat_id, content, keyboard, reply_to=None) — public helper that sends a single short message with a keyboard attached. Does NOT chunk-split (a keyboard message has one interactive surface). Guild chats are rejected non-retryably — they don't support keyboards. - send_approval_request(chat_id, ApprovalRequest, reply_to=None) + send_update_prompt(chat_id, content, reply_to=None) — convenience wrappers over send_with_keyboard. Tests 27 new unit tests under TestApprovalButtonData, TestUpdatePromptButtonData, TestBuildApprovalKeyboard, TestBuildUpdatePromptKeyboard, TestBuildApprovalText, TestInteractionEventParsing, and TestAdapterInteractionDispatch. Cover: - Button-data round-trip (build → parse returns original session/decision) - Keyboard JSON shape + mutual-exclusion group_id - Exec vs plugin approval text templates + severity icons - Interaction event parsing (c2c / group / guild scene codes) - _on_interaction end-to-end: ACK invoked, callback receives parsed event, callback exceptions are swallowed, missing id skips ACK, no registered callback is harmless. Full qqbot suite: 118 passed (72 existing + 19 chunked + 27 keyboards). Co-authored-by: WideLee <limkuan24@gmail.com>	2026-05-07 07:36:30 -07:00
WideLee	9feaeb632b	feat(qqbot): add chunked upload with structured error types The v2 'single POST /v2/{users\|groups}/{id}/files' upload path is capped at ~10 MB inline (base64 'file_data' or 'url'). For larger files the QQ platform provides a three-step flow: 1. POST /upload_prepare → upload_id + pre-signed COS part URLs 2. PUT each part to its COS URL → POST /upload_part_finish 3. POST /files with {upload_id} → file_info token This commit adds a new gateway/platforms/qqbot/chunked_upload.py module that implements the flow, wires it into QQAdapter._send_media for local files (URL uploads keep the existing inline path), and introduces structured exceptions so the caller can surface actionable error text: - UploadDailyLimitExceededError (biz_code 40093002, non-retryable) - UploadFileTooLargeError (file exceeds the platform limit) Both carry file_name / file_size_human / limit_human so the model can compose user-friendly replies instead of seeing opaque HTTP codes. The part_finish 40093001 retryable-error loop respects the server- provided retry_timeout (capped at 10 minutes locally) with a 1 s polling interval. COS PUTs retry transient failures up to 2 times with exponential backoff. complete_upload retries up to 2 times. Covers files up to the platform's ~100 MB per-file limit; before this the adapter silently rejected anything over ~10 MB. 19 new unit tests under TestChunkedUpload* cover the happy path, prepare-response parsing, helper functions, part retries, COS PUT retries, group vs c2c routing, and the structured-error mapping. Co-authored-by: WideLee <limkuan24@gmail.com>	2026-05-07 07:36:30 -07:00

1 2 3 4 5 ...

1534 commits