hermes-agent

mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-14 04:02:26 +00:00

Author	SHA1	Message	Date
Teknium	11a89cc032	docs: backfill coverage for recently-merged features (#11942 ) Fills documentation gaps that accumulated as features merged ahead of their docs updates. All additions are verified against code and the originating PRs. Providers: - Ollama Cloud (#10782) — new provider section, env vars, quickstart/fallback rows - xAI Grok Responses API + TTS (#10783) — provider note, TTS table + config - Google Gemini CLI OAuth (#11270) — quickstart/fallback/cli-commands entries - NVIDIA NIM (#11774) — NVIDIA_API_KEY / NVIDIA_BASE_URL in env-vars reference - HERMES_INFERENCE_PROVIDER enum updated Messaging: - DISCORD_ALLOWED_ROLES (#11608) — env-vars, discord.md access control section - DingTalk QR device-flow (#11574) — wizard path in Option A + openClaw disclosure - Feishu document comment intelligent reply (#11898) — full section + 3-tier access control + CLI Skills / commands: - concept-diagrams skill (#11363) — optional-skills-catalog entry - /gquota (#11270) — slash-commands reference Build: docusaurus build passes, ascii-guard lint 0 errors.	2026-04-17 21:22:11 -07:00
Teknium	45acd9beb5	fix(gateway): ignore redelivered /restart after PTB offset ACK fails (#11940 ) When a Telegram /restart fires and PTB's graceful-shutdown `get_updates` ACK call times out ("When polling for updates is restarted, updates may be received twice" in gateway.log), the new gateway receives the same /restart again and restarts a second time — a self-perpetuating loop. Record the triggering update_id in `.restart_last_processed.json` when handling /restart. On the next process, reject a /restart whose update_id <= the recorded one as a stale redelivery. 5-minute staleness guard so an orphaned marker can't block a legitimately new /restart. - gateway/platforms/base.py: add `platform_update_id` to MessageEvent - gateway/platforms/telegram.py: propagate `update.update_id` through _build_message_event for text/command/location/media handlers - gateway/run.py: write dedup marker in _handle_restart_command; _is_stale_restart_redelivery checks it before processing /restart - tests/gateway/test_restart_redelivery_dedup.py: 9 new tests covering fresh restart, redelivery, staleness window, cross-platform, malformed-marker resilience, and no-update_id (CLI) bypass Only active for Telegram today (the one platform with monotonic cross-session update ordering); other platforms return False from _is_stale_restart_redelivery and proceed normally.	2026-04-17 21:17:33 -07:00
Teknium	c5c0bb9a73	fix: point optional-dep install hints at the venv's python (#11938 ) Error messages that tell users to install optional extras now use {sys.executable} -m pip install ... instead of a bare 'pip install hermes-agent[extra]' string. Under the curl installer, bare 'pip' resolves to system pip, which either fails with PEP 668 externally-managed-environment or installs into the wrong Python. Affects: hermes dashboard, hermes web server startup, mcp_serve, hermes doctor Bedrock check, CLI voice mode, voice_mode tool runtime error, Discord voice-channel join failure message.	2026-04-17 21:16:33 -07:00
Teknium	20f2258f34	fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace (#11907 ) * fix(interrupt): propagate to concurrent-tool workers + opt-in debug trace interrupt() previously only flagged the agent's _execution_thread_id. Tools running inside _execute_tool_calls_concurrent execute on ThreadPoolExecutor worker threads whose tids are distinct from the agent's, so is_interrupted() inside those tools returned False no matter how many times the gateway called .interrupt() — hung ssh / curl / long make-builds ran to their own timeout. Changes: - run_agent.py: track concurrent-tool worker tids in a per-agent set, fan interrupt()/clear_interrupt() out to them, and handle the register-after-interrupt race at _run_tool entry. getattr fallback for the tracker so test stubs built via object.__new__ keep working. - tools/environments/base.py: opt-in _wait_for_process trace (ENTER, per-30s HEARTBEAT with interrupt+activity-cb state, INTERRUPT DETECTED, TIMEOUT, EXIT) behind HERMES_DEBUG_INTERRUPT=1. - tools/interrupt.py: opt-in set_interrupt() trace (caller tid, target tid, set snapshot) behind the same env flag. - tests: new regression test runs a polling tool on a concurrent worker and asserts is_interrupted() flips to True within ~1s of interrupt(). Second new test guards clear_interrupt() clearing tracked worker bits. Validation: tests/run_agent/ all 762 pass; tests/tools/ interrupt+env subset 216 pass. * fix(interrupt-debug): bypass quiet_mode logger filter so trace reaches agent.log AIAgent.__init__ sets logging.getLogger('tools').setLevel(ERROR) when quiet_mode=True (the CLI default). This would silently swallow every INFO-level trace line from the HERMES_DEBUG_INTERRUPT=1 instrumentation added in the parent commit — confirmed by running hermes chat -q with the flag and finding zero trace lines in agent.log even though _wait_for_process was clearly executing (subprocess pid existed). Fix: when HERMES_DEBUG_INTERRUPT=1, each traced module explicitly sets its own logger level to INFO at import time, overriding the 'tools' parent-level filter. Scoped to the opt-in case only, so production (quiet_mode default) logs stay quiet as designed. Validation: hermes chat -q with HERMES_DEBUG_INTERRUPT=1 now writes '_wait_for_process ENTER/EXIT' lines to agent.log as expected. * fix(cli): SIGTERM/SIGHUP no longer orphans tool subprocesses Tool subprocesses spawned by the local environment backend use os.setsid so they run in their own process group. Before this fix, SIGTERM/SIGHUP to the hermes CLI killed the main thread via KeyboardInterrupt but the worker thread running _wait_for_process never got a chance to call _kill_process — Python exited, the child was reparented to init (PPID=1), and the subprocess ran to its natural end (confirmed live: sleep 300 survived 4+ min after SIGTERM to the agent until manual cleanup). Changes: - cli.py _signal_handler (interactive) + _signal_handler_q (-q mode): route SIGTERM/SIGHUP through agent.interrupt() so the worker's poll loop sees the per-thread interrupt flag and calls _kill_process (os.killpg) on the subprocess group. HERMES_SIGTERM_GRACE (default 1.5s) gives the worker time to complete its SIGTERM+SIGKILL escalation before KeyboardInterrupt unwinds main. - tools/environments/base.py _wait_for_process: wrap the poll loop in try/except (KeyboardInterrupt, SystemExit) so the cleanup fires even on paths the signal handlers don't cover (direct sys.exit, unhandled KI from nested code, etc.). Emits EXCEPTION_EXIT trace line when HERMES_DEBUG_INTERRUPT=1. - New regression test: injects KeyboardInterrupt into a running _wait_for_process via PyThreadState_SetAsyncExc, verifies the subprocess process group is dead within 3s of the exception and that KeyboardInterrupt re-raises cleanly afterward. Validation: \| Before \| After \| \|---------------------------------------------------------\|--------------------\| \| sleep 300 survives 4+ min as PPID=1 orphan after SIGTERM \| dies within 2 s \| \| No INTERRUPT DETECTED in trace \| INTERRUPT DETECTED fires + killing process group \| \| tests/tools/test_local_interrupt_cleanup \| 1/1 pass \| \| tests/run_agent/test_concurrent_interrupt \| 4/4 pass \|	2026-04-17 20:39:25 -07:00
Teknium	607be54a24	fix(discord): forum channel media + polish Extend forum support from PR #10145: - REST path (_send_discord): forum thread creation now uploads media files as multipart attachments on the starter message in a single call. Previously media files were silently dropped on the forum path. - Websocket media paths (_send_file_attachment, send_voice, send_image, send_animation — covers send_image_file, send_video, send_document transitively): forum channels now go through a new _forum_post_file helper that creates a thread with the file as starter content, instead of failing via channel.send(file=...) which forums reject. - _send_to_forum chunk follow-up failures are collected into raw_response['warnings'] so partial-send outcomes surface. - Process-local probe cache (_DISCORD_CHANNEL_TYPE_PROBE_CACHE) avoids GET /channels/{id} on every uncached send after the first. - Dedup of TestSendDiscordMedia that the PR merge-resolution left behind. - Docs: Forum Channels section under website/docs/user-guide/messaging/discord.md. Tests: 117 passed (22 new for forum+media, probe cache, warnings).	2026-04-17 20:25:48 -07:00
ChimingLiu	e5333e793c	feat(discord): support forum channels	2026-04-17 20:25:48 -07:00
helix4u	148459716c	fix(kimi): cover remaining fixed-temperature bypasses	2026-04-17 20:25:42 -07:00
Teknium	53e4a2f2c6	feat(update): warn about legacy hermes.service units during hermes update (#11918 ) Follow-up to #11909: surface the legacy-unit warning where users are most likely to see it. After a 'hermes update', if a pre-rename hermes.service is still installed alongside the current hermes-gateway.service, print the list of legacy units + the 'hermes gateway migrate-legacy' command. Profile-safe: reuses _find_legacy_hermes_units() which is an explicit allowlist of hermes.service only — profile units never match. Platform-gated: only prints on systemd hosts (the rename is Linux-only). Non-blocking: just prints, never prompts, so gateway-spawned hermes update --gateway runs aren't affected.	2026-04-17 19:35:12 -07:00
Teknium	07db20c72d	fix(gateway): detect legacy hermes.service + mark --replace SIGTERM as planned (#11909 ) * fix(gateway): detect legacy hermes.service units from pre-rename installs Older Hermes installs used a different service name (hermes.service) before the rename to hermes-gateway.service. When both units remain installed, they fight over the same bot token — after PR #5646's signal-recovery change, this manifests as a 30-second SIGTERM flap loop between the two services. Detection is an explicit allowlist (no globbing) plus an ExecStart content check, so profile units (hermes-gateway-<profile>.service) and unrelated third-party services named 'hermes' are never matched. Wired into systemd_install, systemd_status, gateway_setup wizard, and the main hermes setup flow — anywhere we already warn about scope conflicts now also warns about legacy units. * feat(gateway): add migrate-legacy command + install-time removal prompt - New hermes_cli.gateway.remove_legacy_hermes_units() removes legacy unit files with stop → disable → unlink → daemon-reload. Handles user and system scopes separately; system scope returns path list when not running as root so the caller can tell the user to re-run with sudo. - New 'hermes gateway migrate-legacy' subcommand (with --dry-run and -y) routes to remove_legacy_hermes_units via gateway_command dispatch. - systemd_install now offers to remove legacy units BEFORE installing the new hermes-gateway.service, preventing the SIGTERM flap loop that hits users who still have pre-rename hermes.service around. Profile units (hermes-gateway-<profile>.service) remain untouched in all paths — the legacy allowlist is explicit (_LEGACY_SERVICE_NAMES) and the ExecStart content check further narrows matches. * fix(gateway): mark --replace SIGTERM as planned so target exits 0 PR #5646 made SIGTERM exit the gateway with code 1 so systemd's Restart=on-failure revives it after unexpected kills. But when a user has two gateway units fighting for the same bot token (e.g. legacy hermes.service + hermes-gateway.service from a pre-rename install), the --replace takeover itself becomes the 'unexpected' SIGTERM — the loser exits 1, systemd revives it 30s later, and the cycle flaps indefinitely. Before calling terminate_pid(), --replace now writes a short-lived marker file naming the target PID + start_time. The target's shutdown_signal_handler consumes the marker and, when it names this process, leaves _signal_initiated_shutdown=False so the final exit code stays 0. Staleness defences: - PID + start_time combo prevents PID reuse matching an old marker - Marker older than 60s is treated as stale and discarded - Marker is unlinked on first read even if it doesn't match this process - Replacer clears the marker post-loop + on permission-denied give-up	2026-04-17 19:27:58 -07:00
Teknium	38436eb4e3	chore(release): add pedh to AUTHOR_MAP	2026-04-17 19:26:53 -07:00
pedh	86fd0f846d	docs(dingtalk): document AI Cards, emoji reactions, and display settings - AI Cards: how to configure ``card_template_id`` for streaming rich replies - Emoji reactions: 🤔Thinking → 🥳Done lifecycle - Per-platform display settings (streaming, tool_progress, reasoning, etc.) - Installation: switch to the ``hermes-agent[dingtalk]`` extra (adds alibabacloud-dingtalk alongside dingtalk-stream) - Messaging capability matrix updated to reflect images, audio, video, and threading support	2026-04-17 19:26:53 -07:00
pedh	4459913f40	feat(dingtalk): AI Cards streaming, emoji reactions, and media handling Cherry-picked from #10985 by pedh, adapted to current main: * Keeps main's full group-chat gating (require_mention + allowed_users + free_response_chats + mention_patterns) — PR's simpler subset dropped. * Keeps main's fire-and-forget process() dispatch + session_webhook fallback for SDK >= 0.24. * Picks up PR's REQUIRES_EDIT_FINALIZE capability flag on BasePlatformAdapter + finalize kwarg on edit_message(), plumbed through stream_consumer. Default False so Telegram/Slack/Discord/Matrix stay on the zero-overhead fast path. * DingTalk AI Card lifecycle: per-chat _message_contexts, two-card flow (tool-progress + final response) with sibling auto-close driven by reply_to, idempotent 🤔Thinking → 🥳Done swap, $alibabacloud-dingtalk$ for media URL resolution (replaces raw HTTP that was 403-ing). * pyproject: dingtalk extra now dingtalk-stream>=0.20,<1 + alibabacloud-dingtalk>=2.0.0 + qrcode. Closes #10991 Co-authored-by: pedh	2026-04-17 19:26:53 -07:00
Teknium	d7ef562a05	fix(file-ops): follow terminal env's live cwd in _exec instead of init-time cached cwd (#11912 ) ShellFileOperations captured the terminal env's cwd at __init__ time and used that stale value for every subsequent _exec() call. When the user ran `cd` via the terminal tool, `env.cwd` updated but `ops.cwd` did not. Relative paths passed to patch_replace / read_file / write_file / search then targeted the ORIGINAL directory instead of the current one. Observed symptom in agent sessions: terminal: cd .worktrees/my-branch patch hermes_cli/main.py <old> <new> → returns {"success": true} with a plausible unified diff → but `git diff` in the worktree shows nothing → the patch landed in the main repo's checkout of main.py instead The diff looked legitimate because patch_replace computes it from the IN-MEMORY content vs new_content, not by re-reading the file. The write itself DID succeed — it just wrote to the wrong directory's copy of the same-named file. Fix: _exec() now resolves cwd from live sources in this order: 1. Explicit `cwd` arg (if provided by the caller) 2. Live `self.env.cwd` (tracks `cd` commands run via terminal) 3. Init-time `self.cwd` (fallback when env has no cwd attribute) Includes a 5-test regression suite covering: - cd followed by relative read follows live cwd - the exact reported bug: patch_replace with relative path after cd - explicit cwd= arg still wins over env.cwd - env without cwd attribute falls back to init-time cwd - patch_replace success reflects real file state (safety rail) Co-authored-by: teknium1 <teknium@nousresearch.com>	2026-04-17 19:26:40 -07:00
helix4u	47010e0757	fix(gateway): allow systemd-backed distrobox services	2026-04-17 19:24:30 -07:00
Teknium	213e39463b	chore(release): add akhater to AUTHOR_MAP Contributor of PR #11858 (nous OAuth providers mirror fix). CI blocks releases on unmapped author emails.	2026-04-17 19:13:40 -07:00
Teknium	2297c5f5ce	fix(auth): restore --label for hermes auth add nous --type oauth persist_nous_credentials() now accepts an optional label kwarg which gets embedded in providers.nous under the 'label' key. _seed_from_singletons() prefers the embedded label over the auto-derived label_from_token() fingerprint when materialising the pool entry, so re-seeding on every load_pool('nous') preserves the user's chosen label. auth_commands.py threads --label through to the helper, restoring parity with how other OAuth providers (anthropic, codex, google, qwen) honor the flag. Tests: 4 new (embed, reseed-survives, no-label fallback, end-to-end through auth_add_command). All 390 nous/auth/credential_pool tests pass.	2026-04-17 19:13:40 -07:00
Antoine Khater	c7fece1f9d	fix: normalise Nous device-code pool source to avoid duplicates Review feedback on the original commit: the helper wrote a pool entry with source `manual:device_code` while `_seed_from_singletons()` upserts with `device_code` (no `manual:` prefix), so the pool grew a duplicate row on every `load_pool()` after login. Normalise: the helper now writes `providers.nous` and delegates the pool write entirely to `_seed_from_singletons()` via a follow-up `load_pool()` call. The canonical source is `device_code`; the helper never materialises a parallel `manual:device_code` entry. - `persist_nous_credentials()` loses its `label` and `source` kwargs — both are now derived by the seed path from the singleton state. - CLI and web dashboard call sites simplified accordingly. - New test `test_persist_nous_credentials_idempotent_no_duplicate_pool_entries` asserts that two consecutive persists leave exactly one pool row and no stray `manual:` entries. - Existing `test_auth_add_nous_oauth_persists_pool_entry` updated to assert the canonical source and single-entry invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 19:13:40 -07:00
Antoine Khater	c096a6935f	fix(auth): mirror Nous OAuth credentials to providers.nous on CLI login `hermes auth add nous --type oauth` only wrote credential_pool.nous, leaving providers.nous empty. When the Nous agent_key's 24h TTL expired, run_agent.py's 401-recovery path called resolve_nous_runtime_credentials (which reads providers.nous), got AuthError "Hermes is not logged into Nous Portal", caught it as logger.debug (suppressed at INFO level), and the agent died with "Non-retryable client error" — no signal to the user that recovery even tried. Introduce persist_nous_credentials() as the single source of truth for Nous device-code login persistence. Both auth_commands (CLI) and web_server (dashboard) now route through it, so pool and providers stay in sync at write time. Why: CLI-provisioned profiles couldn't recover from agent_key expiry, producing silent daily outages 24h after first login. PR #6856/#6869 addressed adjacent issues but assumed providers.nous was populated; this one wasn't being written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 19:13:40 -07:00
Teknium	a155b4a159	feat(auxiliary): default 'auto' routing to main model for all users (#11900 ) Before: aggregator users (OpenRouter / Nous Portal) running 'auto' routing for auxiliary tasks — compression, vision, web extraction, session search, etc. — got routed to a cheap provider-side default model (Gemini Flash). Non-aggregator users already got their main model. Behavior was inconsistent and surprising — users picked Claude / GPT / their preferred model, but side tasks ran on Gemini Flash. After: 'auto' means "use my main chat model" for every user, regardless of provider type. Only when the main provider has no working client does the fallback chain run (OpenRouter → Nous → custom → Codex → API-key providers). Explicit per-task overrides in config.yaml (auxiliary.<task>.provider / .model) still win — they are a hard constraint, not subject to the auto policy. Vision auto-detection follows the same policy: try main provider + main model first (with _PROVIDER_VISION_MODELS overrides preserved for providers like xiaomi and zai that ship a dedicated multimodal model distinct from their chat model). Aggregator strict vision backends are fallbacks, not the primary path. Changes: - agent/auxiliary_client.py: _resolve_auto() drops the `_AGGREGATOR_PROVIDERS` guard. resolve_vision_provider_client() auto branch unifies aggregator and exotic-provider paths — everyone goes through resolve_provider_client() with main_model. Dead _AGGREGATOR_PROVIDERS constant removed (was only used by the guard we just removed). - hermes_cli/main.py: aux config menu copy updated to reflect the new semantics ("'auto' means 'use my main model'"). - tests/agent/test_auxiliary_main_first.py: 12 regression tests covering OpenRouter/Nous/DeepSeek main paths, runtime-override wins, explicit-config wins, vision override preservation for exotic providers, and fallback-chain activation when the main provider has no working client. Co-authored-by: teknium1 <teknium@nousresearch.com>	2026-04-17 19:13:23 -07:00
Teknium	b449a0e049	fix(feishu-comment): use get_hermes_home(); drop dead asyncio wrapper; AUTHOR_MAP Follow-up polish on top of the cherry-picked #11023 commit. - feishu_comment_rules.py: replace import-time "~/.hermes" expanduser fallback with get_hermes_home() from hermes_constants (canonical, profile-safe). - tools/feishu_doc_tool.py, tools/feishu_drive_tool.py: drop the asyncio.get_event_loop().run_until_complete(asyncio.to_thread(...)) dance. Tool handlers run synchronously in a worker thread with no running loop, so the RuntimeError branch was always the one that executed. Calls client.request directly now. Unused asyncio import removed. - tests/gateway/test_feishu.py: add register_p2_customized_event to the mock EventDispatcher builder so the existing adapter test matches the new handler registration for drive.notice.comment_add_v1. - scripts/release.py: map liujinkun@bytedance.com -> liujinkun2025 for contributor attribution on release notes.	2026-04-17 19:04:11 -07:00
liujinkun	85cdb04bd4	feat: add Feishu document comment intelligent reply with 3-tier access control - Full comment handler: parse drive.notice.comment_add_v1 events, build timeline, run agent, deliver reply with chunking support. - 5 tools: feishu_doc_read, feishu_drive_list_comments, feishu_drive_list_comment_replies, feishu_drive_reply_comment, feishu_drive_add_comment. - 3-tier access control rules (exact doc > wildcard "*" > top-level > defaults) with per-field fallback. Config via ~/.hermes/feishu_comment_rules.json, mtime-cached hot-reload. - Self-reply filter using generalized self_open_id (supports future user-identity subscriptions). Receiver check: only process events where the bot is the @mentioned target. - Smart timeline selection, long text chunking, semantic text extraction, session sharing per document, wiki link resolution. Change-Id: I31e82fd6355173dbcc400b8934b6d9799e3137b9	2026-04-17 19:04:11 -07:00
Teknium	9b14b76eb3	fix(wecom): bound req_id cache, revert undocumented is_group change, add tests Follow-up to the cherry-picked contributor fix: - Extract `_remember_chat_req_id()` and bound it at DEDUP_MAX_SIZE like `_reply_req_ids` — the unbounded dict would grow forever on a long- running gateway with many chats. - Move the cache write to AFTER the group/DM policy check so we don't cache req_ids from blocked senders. - Revert the undocumented `is_group` change: the contributor flipped `chattype == 'group'` to `bool(chatid)`, which wasn't mentioned in the PR description and weakens the signal (chattype is the explicit hint; relying on chatid presence assumes DMs never carry it). Keep the original check. - Drop the defensive `getattr(self, '_last_chat_req_ids', {})` reads at both send sites — the attribute is initialized in __init__. - Update `test_send_uses_passive_reply_stream_...` → `_markdown_...` to match the new msgtype, and add a new TestWeComZombieSessionFix class covering device_id presence in subscribe, per-chat req_id caching + bounding, blocked-sender cache exclusion, and the group APP_CMD_RESPONSE fallback path.	2026-04-17 19:03:29 -07:00
Devorun	2992802b35	fix(wecom): resolve WebSocket zombie sessions and group chat 600039 errors #11554	2026-04-17 19:03:29 -07:00
Teknium	04a0c3cb95	fix(config): preserve env refs when save_config rewrites config (#11892 ) Co-authored-by: binhnt92 <84617813+binhnt92@users.noreply.github.com>	2026-04-17 19:03:26 -07:00
Teknium	8444f66890	feat(hermes model): add Configure auxiliary models UI to `hermes model` (#11891 ) Previously users had to hand-edit config.yaml to route individual auxiliary tasks (vision, compression, web_extract, etc.) to a specific provider+model. Add a first-class picker reachable from the bottom of the existing `hermes model` provider list. Flow: hermes model → Configure auxiliary models... → <task picker: 9 tasks, shows current setting inline> → <provider picker: authenticated providers + auto + custom> → <model picker: curated list + live pricing> The aux picker does NOT re-run credential/OAuth setup; users authenticate providers through the normal `hermes model` flow, then route aux tasks to them here. `list_authenticated_providers()` gates the list to providers the user has configured. Also: - 'Cancel' entry relabeled 'Leave unchanged' (sentinel still 'cancel' internally, so dispatch logic is unchanged) - 'Reset all to auto' entry to bulk-clear aux overrides; preserves user-tuned timeout / download_timeout values - Adds `title_generation` task to DEFAULT_CONFIG.auxiliary — the task was called from agent/title_generator.py but was missing from defaults, so config-backed timeout overrides never worked for it Co-authored-by: teknium1 <teknium@nousresearch.com>	2026-04-17 19:02:06 -07:00
Teknium	bb85404b16	chore: add Sara Reynolds to AUTHOR_MAP	2026-04-17 18:58:29 -07:00
Sara Reynolds	8ab1aa2efc	fix(gateway): fix discrepancies in gateway status	2026-04-17 18:58:29 -07:00
Xowiek	511ed4dacc	fix(gateway): bypass active-session guard for gateway-handled slash commands	2026-04-17 18:58:03 -07:00
Michel Belleau	d465fc5869	fix(skills): use frontmatter name in skills index instead of directory name build_skills_system_prompt() was using the skill directory name (skill_name) when appending to skills_by_category in all three code paths (snapshot cache, cold filesystem scan, external dirs). This meant any skill whose directory name differed from its frontmatter `name` field would appear under the wrong name in the system prompt, causing LLM routing failures. The snapshot entry already stores both skill_name (dir) and frontmatter_name (declared); switch the three tuple appends to use frontmatter_name. Also fix the external-dir dedup set (seen_skill_names) to track frontmatter names for consistency with the local-skill tuples now stored under frontmatter_name. Fixes #11777 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-17 18:56:37 -07:00
helix4u	016ae5c334	fix(kimi): force 0.6 on main chat path	2026-04-17 18:47:01 -07:00
Teknium	304fb921bf	fix: two process leaks (agent-browser daemons, paste.rs sleepers) (#11843 ) Both fixes close process leaks observed in production (18+ orphaned agent-browser node daemons, 15+ orphaned paste.rs sleep interpreters accumulated over ~3 days, ~2.7 GB RSS). ## agent-browser daemon leak Previously the orphan reaper (_reap_orphaned_browser_sessions) only ran from _start_browser_cleanup_thread, which is only invoked on the first browser tool call in a process. Hermes sessions that never used the browser never swept orphans, and the cross-process orphan detection relied on in-process _active_sessions, which doesn't see other hermes PIDs' sessions (race risk). - Write <session>.owner_pid alongside the socket dir recording the hermes PID that owns the daemon (extracted into _write_owner_pid for direct testability). - Reaper prefers owner_pid liveness over in-process _active_sessions. Cross-process safe: concurrent hermes instances won't reap each other's daemons. Legacy tracked_names fallback kept for daemons that predate owner_pid. - atexit handler (_emergency_cleanup_all_sessions) now always runs the reaper, not just when this process had active sessions — every clean hermes exit sweeps accumulated orphans. ## paste.rs auto-delete leak _schedule_auto_delete spawned a detached Python subprocess per call that slept 6 hours then issued DELETE requests. No dedup, no tracking — every 'hermes debug share' invocation added ~20 MB of resident Python interpreters that stuck around until the sleep finished. - Replaced the spawn with ~/.hermes/pastes/pending.json: records {url, expire_at} entries. - _sweep_expired_pastes() synchronously DELETEs past-due entries on every 'hermes debug' invocation (run_debug() dispatcher). - Network failures stay in pending.json for up to 24h, then give up (paste.rs's own retention handles the 'user never runs hermes again' edge case). - Zero subprocesses; regression test asserts subprocess/Popen/time.sleep never appear in the function source (skipping docstrings via AST). ## Validation \| \| Before \| After \| \|------------------------------\|---------------\|--------------\| \| Orphan agent-browser daemons \| 18 accumulated\| 2 (live) \| \| paste.rs sleep interpreters \| 15 accumulated\| 0 \| \| RSS reclaimed \| - \| ~2.7 GB \| \| Targeted tests \| - \| 2253 pass \| E2E verified: alive-owner daemons NOT reaped; dead-owner daemons SIGTERM'd and socket dirs cleaned; pending.json sweep deletes expired entries without spawning subprocesses.	2026-04-17 18:46:30 -07:00
helix4u	64b354719f	Support browser CDP URL from config	2026-04-17 16:05:04 -07:00
brooklyn!	e9b8ece103	Merge pull request #4692 from NousResearch/feat/ink-refactor Feat/ink refactor	2026-04-17 18:02:37 -05:00
Teknium	3f43aec15d	fix(tools): bound _read_tracker sub-containers + prune _completion_consumed (#11839 ) Two accretion-over-time leaks that compound over long CLI / gateway lifetimes. Both were flagged in the memory-leak audit. ## file_tools._read_tracker _read_tracker[task_id] holds three sub-containers that grew unbounded: read_history set of (path, offset, limit) tuples — 1 per unique read dedup dict of (path, offset, limit) → mtime — same growth pattern read_timestamps dict of resolved_path → mtime — 1 per unique path A CLI session uses one stable task_id for its lifetime, so these were uncapped. A 10k-read session accumulated ~1.5MB of tracker state that the tool no longer needed (only the most recent reads are relevant for dedup, consecutive-loop detection, and write/patch external-edit warnings). Fix: _cap_read_tracker_data() enforces hard caps on each container after every add. Defaults: read_history=500, dedup=1000, read_timestamps=1000. Eviction is insertion-order (Python 3.7+ dict guarantee) for the dicts; arbitrary for the set (which only feeds diagnostic summaries). ## process_registry._completion_consumed Module-level set that recorded every session_id ever polled / waited / logged. No pruning. Each entry is ~20 bytes, so the absolute leak is small, but on a gateway processing thousands of background commands per day the set grows until process exit. Fix: _prune_if_needed() now discards _completion_consumed entries alongside the session dict evictions it already performs (both the TTL-based prune and the LRU-over-cap prune). Adds a final belt-and-suspenders pass that drops any dangling entries whose session_id no longer appears in _running or _finished. Tests: tests/tools/test_accretion_caps.py — 9 cases * Each container bound respected, oldest evicted * No-op when under cap (no unnecessary work) * Handles missing sub-containers without crashing * Live read_file_tool path enforces caps end-to-end * _completion_consumed pruned on TTL expiry * _completion_consumed pruned on LRU eviction * Dangling entries (no backing session) cleared Broader suite: 3486 tests/tools + tests/cli pass. The single flake (test_alias_command_passes_args) reproduces on unchanged main — known cross-test pollution under suite-order load.	2026-04-17 15:53:57 -07:00
Brooklyn Nicholson	aa583cb14e	Merge branch 'main' of github.com:NousResearch/hermes-agent into feat/ink-refactor	2026-04-17 17:51:40 -05:00
Teknium	0a83187801	refactor(kimi): use _fixed_temperature_for_model helper in flush_memories Replace the hardcoded 'kimi-for-coding' string check with the helper from auxiliary_client so there is one source of truth for the list of models with fixed-temperature contracts. Adding a new entry to _FIXED_TEMPERATURE_MODELS now automatically covers flush_memories too.	2026-04-17 15:49:14 -07:00
helix4u	2b60478fc2	fix(kimi): force kimi-for-coding temperature to 0.6	2026-04-17 15:49:14 -07:00
Teknium	c6fd2619f7	fix(gemini-cli): surface MODEL_CAPACITY_EXHAUSTED cleanly + drop retired gemma-4-26b (#11833 ) Google-side 429 Code Assist errors now flow through Hermes' normal rate-limit path (status_code on the exception, Retry-After preserved via error.response) instead of being opaque RuntimeErrors. User sees a one-line capacity message instead of a 500-char JSON dump. Changes - CodeAssistError grows status_code / response / retry_after / details attrs. _extract_status_code in error_classifier picks up status_code and classifies 429 as FailoverReason.rate_limit, so fallback_providers triggers the same way it does for SDK errors. run_agent.py line ~10428 already walks error.response.headers for Retry-After — preserving the response means that path just works. - _gemini_http_error parses the Google error envelope (error.status + error.details[].reason from google.rpc.ErrorInfo, retryDelay from google.rpc.RetryInfo). MODEL_CAPACITY_EXHAUSTED / RESOURCE_EXHAUSTED / 404 model-not-found each produce a human-readable message; unknown shapes fall back to the previous raw-body format. - Drop gemma-4-26b-it from hermes_cli/models.py, hermes_cli/setup.py, and agent/model_metadata.py — Google returned 404 for it today in local repro. Kept gemma-4-31b-it (capacity-constrained but not retired). Validation \| \| Before \| After \| \|---------------------------\|--------------------------------\|-------------------------------------------\| \| Error message \| 'Code Assist returned HTTP 429: {500 chars JSON}' \| 'Gemini capacity exhausted for gemini-2.5-pro (Google-side throttle...)' \| \| status_code on error \| None (opaque RuntimeError) \| 429 \| \| Classifier reason \| unknown (string-match fallback) \| FailoverReason.rate_limit \| \| Retry-After honored \| ignored \| extracted from RetryInfo or header \| \| gemma-4-26b-it picker \| advertised (404s on Google) \| removed \| Unit + E2E tests cover non-streaming 429, streaming 429, 404 model-not-found, Retry-After header fallback, malformed body, and classifier integration. Targeted suites: tests/agent/test_gemini_cloudcode.py (81 tests), full tests/hermes_cli (2203 tests) green. Co-authored-by: teknium1 <teknium@nousresearch.com>	2026-04-17 15:34:12 -07:00
Teknium	d2206c69cc	fix(qqbot): add back-compat for env var rename; drop qrcode core dep Follow-up to WideLee's salvaged PR #11582. Back-compat for QQ_HOME_CHANNEL → QQBOT_HOME_CHANNEL rename: - gateway/config.py reads QQBOT_HOME_CHANNEL, falls back to QQ_HOME_CHANNEL with a one-shot deprecation warning so users on the old name aren't silently broken. - cron/scheduler.py: _HOME_TARGET_ENV_VARS['qqbot'] now maps to the new name; _get_home_target_chat_id falls back to the legacy name via a _LEGACY_HOME_TARGET_ENV_VARS table. - hermes_cli/status.py + hermes_cli/setup.py: honor both names when displaying or checking for missing home channels. - hermes_cli/config.py: keep legacy QQ_HOME_CHANNEL[_NAME] in _EXTRA_ENV_KEYS so .env sanitization still recognizes them. Scope cleanup: - Drop qrcode from core dependencies and requirements.txt (remains in messaging/dingtalk/feishu extras). _qqbot_render_qr already degrades gracefully when qrcode is missing, printing a 'pip install qrcode' tip and falling back to URL-only display. - Restore @staticmethod on QQAdapter._detect_message_type (it doesn't use self). Revert the test change that was only needed when it was converted to an instance method. - Reset uv.lock to origin/main; the PR's stale lock also included unrelated changes (atroposlib source URL, hermes-agent version bump, fastapi additions) that don't belong. Verified E2E: - Existing user (QQ_HOME_CHANNEL set): gateway + cron both pick up the legacy name; deprecation warning logs once. - Fresh user (QQBOT_HOME_CHANNEL set): gateway + cron use new name, no warning. - Both set: new name wins on both surfaces. Targeted tests: 296 passed, 4 skipped (qqbot + cron + hermes_cli).	2026-04-17 15:31:14 -07:00
WideLee	103beea7a6	fix(qqbot): fix test failures after package refactor - Re-export _ssrf_redirect_guard from __init__.py - Fix _parse_json @staticmethod using self._log_tag - Update test_detect_message_type to call as instance method - Fix mock.patch path for httpx.AsyncClient in adapter submodule	2026-04-17 15:31:14 -07:00
WideLee	287d3e12c7	chore: add author map	2026-04-17 15:31:14 -07:00
WideLee	6fd58e1e4a	refactor(qqbot): replace log tags with self._log_tag	2026-04-17 15:31:14 -07:00
WideLee	235e6ecc0e	refactor(qqbot): replace hardcoded log tags with self._log_tag and adjust STT log levels - Remove @staticmethod from _detect_message_type, _convert_silk_to_wav, _convert_raw_to_wav, _convert_ffmpeg_to_wav so they can use self._log_tag - Replace all remaining hardcoded "QQBot" log args with self._log_tag - Downgrade STT routine flow logs (download, convert, success) from info to debug - Keep warning level for actual failures (STT failed, ffmpeg error, empty transcript)	2026-04-17 15:31:14 -07:00
WideLee	1648e41c17	refactor(qqbot): change qrcode style	2026-04-17 15:31:14 -07:00
WideLee	c4cdf3b861	refactor(qqbot): change setup method selection prompt_choice style	2026-04-17 15:31:14 -07:00
WideLee	02f5e3dc27	refactor(qqbot): use _log_tag with app_id in all logger calls for multi-instance disambiguation	2026-04-17 15:31:14 -07:00
WideLee	b7d330211a	fix(qqbot): simplify home channel prompt wording	2026-04-17 15:31:14 -07:00
WideLee	a5f4d652d3	feat(qqbot): prompt to add scanned user to allow list and home channel during setup	2026-04-17 15:31:14 -07:00
WideLee	6358501915	refactor(qqbot): split qqbot.py into package & add QR scan-to-configure onboard flow - Refactor gateway/platforms/qqbot.py into gateway/platforms/qqbot/ package: - adapter.py: core QQAdapter (unchanged logic, constants from shared module) - constants.py: shared constants (API URLs, timeouts, message types) - crypto.py: AES-256-GCM key generation and secret decryption - onboard.py: QR-code scan-to-configure API (create_bind_task, poll_bind_result) - utils.py: User-Agent builder, HTTP headers, config helpers - __init__.py: re-exports all public symbols for backward compatibility - Add interactive QR-code setup flow in hermes_cli/gateway.py: - Terminal QR rendering via qrcode package (graceful fallback to URL) - Auto-refresh on QR expiry (up to 3 times) - AES-256-GCM encrypted credential exchange - DM security policy selection (pairing/allowlist/open) - Update hermes_cli/setup.py to delegate to gateway's _setup_qqbot() - Add qrcode>=7.4 dependency to pyproject.toml and requirements.txt	2026-04-17 15:31:14 -07:00
Teknium	31e7276474	fix(gateway): consolidate per-session cleanup; close SessionDB on shutdown (#11800 ) Three closely-related fixes for shutdown / lifecycle hygiene. 1. _release_running_agent_state(session_key) helper ---------------------------------------------------- Per-running-agent state lived in three dicts that drifted out of sync across cleanup sites: self._running_agents — AIAgent per session_key self._running_agents_ts — start timestamp per session_key self._busy_ack_ts — last busy-ack timestamp per session_key Inventory before this PR: 8 sites: del self._running_agents[key] — only 1 (stale-eviction) cleaned all three — 1 cleaned _running_agents + _running_agents_ts only — 6 cleaned _running_agents only Each missed entry was a (str, float) tuple per session per gateway lifetime — small, persistent, accumulates across thousands of sessions over months. Per-platform leaks compounded. This change adds a single helper that pops all three dicts in lockstep, and replaces every bare 'del self._running_agents[key]' site with it. Per-session state that PERSISTS across turns (_session_model_overrides, _voice_mode, _pending_approvals, _update_prompt_pending) is intentionally NOT touched here — those have their own lifecycles tied to user actions, not turn boundaries. 2. _running_agents_ts cleared in _stop_impl ---------------------------------------- Was being missed alongside _running_agents.clear(); now included. 3. SessionDB close() in _stop_impl --------------------------------- The SQLite WAL write lock stayed held by the old gateway connection until Python actually exited — causing 'database is locked' errors when --replace launched a new gateway against the same file. We now explicitly close both self._db and self.session_store._db inside _stop_impl, with try/except so a flaky close on one doesn't block the other. Tests ----- tests/gateway/test_session_state_cleanup.py — 10 cases covering: * helper pops all three dicts atomically * idempotent on missing/empty keys * preserves other sessions * tolerates older runners without _busy_ack_ts attribute * thread-safe under concurrent release * regression guard: scans gateway/run.py and fails if a future contributor reintroduces 'del self._running_agents[...]' outside docstrings * SessionDB close called on both holders during shutdown * shutdown tolerates missing session_store * shutdown tolerates close() raising on one db (other still closes) Broader gateway suite: 3108 passed (vs 3100 on baseline) — failure delta is +8 net passes; the 10 remaining failures are pre-existing cross-test pollution / missing optional deps (matrix needs olm, signal/telegram approval flake, dingtalk Mock wiring), all reproduce on stashed baseline.	2026-04-17 15:18:23 -07:00

... 27 28 29 30 31 ...

6158 commits