Commit graph

696 commits

Author SHA1 Message Date
Teknium
091d8e1030
feat(codex-runtime): optional codex app-server runtime for OpenAI/Codex models (#24182)
* feat(codex-runtime): scaffold optional codex app-server runtime

Foundational commit for an opt-in alternate runtime that hands OpenAI/Codex
turns to a 'codex app-server' subprocess instead of Hermes' tool dispatch.
Default behavior is unchanged.

Lands in three pieces:

1. agent/transports/codex_app_server.py — JSON-RPC 2.0 over stdio speaker
   for codex's app-server protocol (codex-rs/app-server). Spawn, init
   handshake, request/response, notification queue, server-initiated
   request queue (for approval round-trips), interrupt-friendly blocking
   reads. Tested against real codex 0.130.0 binary end-to-end during
   development.

2. hermes_cli/runtime_provider.py:
   - Adds 'codex_app_server' to _VALID_API_MODES.
   - Adds _maybe_apply_codex_app_server_runtime() helper, called at the
     end of _resolve_runtime_from_pool_entry(). Inert unless
     'model.openai_runtime: codex_app_server' is set in config.yaml AND
     provider in {openai, openai-codex}. Other providers cannot be
     rerouted (anthropic, openrouter, etc. preserved).

3. tests/agent/transports/test_codex_app_server_runtime.py — 24 tests
   covering api_mode registration, the rewriter helper (default-off,
   case-insensitive, opt-in, non-eligible providers preserved), version
   parser, missing-binary handling, error class. Does NOT require codex
   CLI installed.

This commit is wire-only: the api_mode is recognized but AIAgent does
not yet branch on it. Followup commits add the session adapter, event
projector, approval bridge, transcript projection (so memory/skill
review still works), plugin migration, and slash command.

Existing tests remain green:
- tests/cli/test_cli_provider_resolution.py (29 passed)
- tests/agent/test_credential_pool_routing.py (included above)

* feat(codex-runtime): add codex item projector for memory/skill review

The translator that lets Hermes' self-improvement loop keep working under the
Codex runtime: converts codex 'item/*' notifications into Hermes' standard
{role, content, tool_calls, tool_call_id} message shape that
agent/curator.py already knows how to read.

Item taxonomy (matches codex-rs/app-server-protocol/src/protocol/v2/item.rs):
  - userMessage          → {role: user, content}
  - agentMessage         → {role: assistant, content: text}
  - reasoning            → stashed in next assistant's 'reasoning' field
  - commandExecution     → assistant tool_call(name='exec_command') + tool result
  - fileChange           → assistant tool_call(name='apply_patch') + tool result
  - mcpToolCall          → assistant tool_call(name='mcp.<server>.<tool>') + tool result
  - dynamicToolCall      → assistant tool_call(name=<tool>) + tool result
  - plan/hookPrompt/etc  → opaque assistant note, no fabricated tool_calls

Invariants preserved:
  - Message role alternation never violated: each tool item produces at most
    one assistant + one tool message in that order, correlated by call_id.
  - Streaming deltas (item/<type>/outputDelta, item/agentMessage/delta)
    don't materialize messages — only item/completed does. Mirrors how
    Hermes already only writes the assistant message after streaming ends.
  - Tool call ids are deterministic (codex item id-based) so replays produce
    identical messages and prefix caches stay valid (AGENTS.md pitfall #16).
  - JSON args use sorted_keys for the same reason.

Real wire formats verified against codex 0.130.0 by capturing live
notifications from thread/shellCommand and including one as a fixture
(COMMAND_EXEC_COMPLETED).

23 new tests, all green:
  - Streaming deltas don't materialize (3 paths)
  - Turn/thread frame events are silent
  - commandExecution: 5 tests including non-zero exit annotation +
    deterministic id stability across replays
  - agentMessage + reasoning attachment + reasoning consumption
  - fileChange: summary without inlined content
  - mcpToolCall: namespaced naming + error surfacing
  - userMessage: text fragments only (drops images/etc)
  - opaque items: no fabricated tool_calls
  - Helpers: deterministic id stability + sorted JSON args
  - Role alternation invariant across all four tool-shaped item types

This commit is a pure addition. AIAgent integration (the wire that uses the
projector) is the next commit.

* feat(codex-runtime): add session adapter + approval bridge

The third self-contained module: CodexAppServerSession owns one Codex
thread per Hermes session, drives turn/start, consumes streaming
notifications via CodexEventProjector, handles server-initiated approval
requests, and translates cancellation into turn/interrupt.

The adapter has a single public per-turn method:

    result = session.run_turn(user_input='...', turn_timeout=600)
    # result.final_text          → assistant text for the caller
    # result.projected_messages  → list ready to splice into AIAgent.messages
    # result.tool_iterations     → tick count for _iters_since_skill nudge
    # result.interrupted         → True on Ctrl+C / deadline / interrupt
    # result.error               → error string when the turn cannot complete
    # result.turn_id, thread_id  → for sessions DB / resume

Behavior:

  - ensure_started() spawns codex, does the initialize handshake, and
    issues thread/start with cwd + permissions profile. Idempotent.
  - run_turn() blocks until turn/completed, drains server-initiated
    requests (approvals) before reading notifications so codex never
    deadlocks waiting for us, projects every item/completed via the
    projector, and increments tool_iterations for the skill nudge gate.
  - request_interrupt() is thread-safe (threading.Event); the next loop
    iteration issues turn/interrupt and unwinds.
  - turn_timeout deadlock guard issues turn/interrupt and records an
    error if the turn never completes.
  - close() escalates terminate → kill via the underlying client.

Approval bridge:

  Codex emits server-initiated requests for execCommandApproval and
  applyPatchApproval. The adapter translates Hermes' approval choice
  vocabulary onto codex's decision vocabulary:

    Hermes 'once'                → codex 'approved'
    Hermes 'session' or 'always' → codex 'approvedForSession'
    Hermes 'deny' / anything else → codex 'denied'

  Routing precedence:
    1. _ServerRequestRouting.auto_approve_* flags (cron / non-interactive)
    2. approval_callback wired by the CLI (defers to
       tools.approval.prompt_dangerous_approval())
    3. Fail-closed denial when neither is wired

  Unknown server-request methods are answered with JSON-RPC error -32601
  so codex doesn't hang waiting for us.

Permission profile mapping mirrors AGENTS.md:
    Hermes 'auto'              → codex 'workspace-write'
    Hermes 'approval-required' → codex 'read-only-with-approval'
    Hermes 'unrestricted/yolo' → codex 'full-access'

20 new tests, all green. Combined with prior commits this PR now has
67 tests across three modules:
  - test_codex_app_server_runtime.py: 24 (api_mode + transport surface)
  - test_codex_event_projector.py: 23 (item taxonomy projections)
  - test_codex_app_server_session.py: 20 (turn loop + approvals + interrupts)

Full tests/agent/transports/ directory: 249/249 pass — no regressions
to existing transport tests.

Still no wire into AIAgent.run_conversation(); that integration commit
is small and goes next.

* feat(codex-runtime): wire codex_app_server runtime into AIAgent

The integration commit. AIAgent.run_conversation() now early-returns to a
new helper _run_codex_app_server_turn() when self.api_mode ==
'codex_app_server', bypassing the chat_completions tool loop entirely.

Three small surgical edits to run_agent.py (~105 LOC total):

1. Line ~1204 (constructor api_mode validation set):
   Add 'codex_app_server' so an explicit api_mode='codex_app_server'
   passed to AIAgent() isn't silently rewritten to 'chat_completions'.

2. Line ~12048 (run_conversation, just before the while loop):
   Early-return to _run_codex_app_server_turn() when self.api_mode is
   'codex_app_server'. Placed AFTER all standard pre-loop setup —
   logging context, session DB, surrogate sanitization, _user_turn_count
   and _turns_since_memory increments, _ext_prefetch_cache, memory
   manager on_turn_start — so behavior outside the model-call loop is
   identical between paths. Default Hermes flow is unchanged when the
   flag is off.

3. End-of-class (line ~15497):
   New method _run_codex_app_server_turn(). Lazy-instantiates one
   CodexAppServerSession per AIAgent (reused across turns), runs the
   turn, splices projected_messages into messages, increments
   _iters_since_skill by tool_iterations (since the chat_completions
   loop normally does that per iteration), fires
   _spawn_background_review on the same cadence as the default path.

Counter accounting:

  _turns_since_memory  ← already incremented at run_conversation:11817
                         (gated on memory store configured) — codex
                         helper does NOT touch it (would double-count).
  _user_turn_count     ← already incremented at run_conversation:11793
                         — codex helper does NOT touch it.
  _iters_since_skill   ← incremented in the chat_completions loop per
                         tool iteration. Codex helper increments by
                         turn.tool_iterations since the loop is bypassed.

User message:

  ALREADY appended to messages by run_conversation pre-loop (line 11823)
  before the early-return reaches us. Helper does NOT append again.
  Regression test test_user_message_not_duplicated guards this.

Approval callback wiring:

  Lazy-fetches tools.terminal_tool._get_approval_callback at session
  spawn time, passes to CodexAppServerSession. CLI threads with
  prompt_toolkit get interactive approvals; gateway/cron contexts get
  the codex-side fail-closed deny.

Error path:

  Codex session exceptions become a 'partial' result with completed=False
  and a final_response that explicitly tells the user how to switch back:
  'Codex app-server turn failed: ... Fall back to default runtime with
  /codex-runtime auto.' Same return-dict shape as the chat_completions
  path so all callers (gateway, CLI, batch_runner, ACP) work unchanged.

9 new integration tests in tests/run_agent/test_codex_app_server_integration.py:
  - api_mode='codex_app_server' is accepted on AIAgent construction
  - run_conversation returns the expected codex shape
    (final_response, codex_thread_id, codex_turn_id, completed, partial)
  - Projected messages are spliced into messages list
  - _iters_since_skill ticks per tool iteration
  - _user_turn_count delegated to standard flow (not double-counted)
  - User message appears exactly once (regression guard)
  - _spawn_background_review IS invoked (memory/skill review keeps working)
  - chat.completions.create is NEVER called (loop fully bypassed)
  - Session exception → partial result with /codex-runtime auto hint
  - Interrupted turn → partial result with error preserved

Adjacent test runs confirm no regressions:
  - tests/run_agent/test_memory_nudge_counter_hydration.py: green
  - tests/run_agent/test_background_review.py: green
  - tests/run_agent/test_fallback_model.py: green
  - tests/agent/transports/: 249/249 green

Still missing for full feature: /codex-runtime slash command, plugin
migration helper, docs page, live e2e test gated on codex binary. Those
are the remaining followup commits.

* feat(codex-runtime): add /codex-runtime slash command (CLI + gateway)

User-facing toggle for the optional codex app-server runtime. Follows the
'Adding a Slash Command (All Platforms)' pattern from AGENTS.md exactly:
single CommandDef in the central registry → CLI handler → gateway handler
→ running-agent guard → all surfaces (autocomplete, /help, Telegram menu,
Slack subcommands) update automatically.

Surface:
    /codex-runtime                    — show current state + codex CLI status
    /codex-runtime auto               — Hermes default runtime
    /codex-runtime codex_app_server   — codex subprocess runtime
    /codex-runtime on / off           — synonyms

Files changed:

  hermes_cli/codex_runtime_switch.py (new):
    Pure-Python state machine shared by CLI and gateway. Parse args,
    read/write model.openai_runtime in the config dict, gate enabling
    behind a codex --version check (don't let users opt in to a runtime
    they have no binary for; print npm install hint instead).
    Returns a CodexRuntimeStatus dataclass that callers render however
    suits their surface.

  hermes_cli/commands.py:
    Single CommandDef entry, no aliases (codex-runtime is its own thing).

  cli.py:
    Dispatch in process_command() + _handle_codex_runtime() handler that
    delegates to the shared module and renders results via _cprint.

  gateway/run.py:
    Dispatch in _handle_message() + _handle_codex_runtime_command() that
    returns a string (gateway sends as message). On a successful change
    that requires a new session, _evict_cached_agent() forces the next
    inbound message to construct a fresh AIAgent with the new api_mode —
    avoids prompt-cache invalidation mid-session.

  gateway/run.py running-agent guard:
    /codex-runtime joins /model in the early-intercept block so a runtime
    flip mid-turn can't split a turn across two transports.

Tests:
  tests/hermes_cli/test_codex_runtime_switch.py — 25 tests covering the
  state machine: arg parsing (10 cases incl. case-insensitive and
  synonyms), reading current runtime (5 cases incl. malformed configs),
  writing runtime (3 cases), apply() entry point covering read-only,
  no-op, codex-missing-blocked, codex-present-success, disable-no-binary-check,
  and persist-failure paths (8 cases). All green.

Adjacent test suites confirm no regressions:
  - tests/hermes_cli/test_commands.py + test_codex_runtime_switch.py:
    167/167 green
  - tests/agent/transports/: 283/283 green when combined with prior commits

Still missing: plugin migration helper, docs page, live e2e test gated on
codex binary. Followup commits.

* feat(codex-runtime): auto-migrate Hermes MCP servers to ~/.codex/config.toml

Translates the user's mcp_servers config from ~/.hermes/config.yaml into
the TOML format codex's MCP client expects. Wired into the
/codex-runtime codex_app_server enable path so users get their MCP tool
surface in the spawned subprocess automatically.

The migration runs on every enable. Failures are non-fatal — the runtime
change still proceeds and the user gets a warning so they can fix the
codex config manually.

What translates (mapping verified against codex-rs/core/src/config/edit.rs):
  Hermes mcp_servers.<n>.command/args/env  → codex stdio transport
  Hermes mcp_servers.<n>.url/headers       → codex streamable_http transport
  Hermes mcp_servers.<n>.timeout           → codex tool_timeout_sec
  Hermes mcp_servers.<n>.connect_timeout   → codex startup_timeout_sec
  Hermes mcp_servers.<n>.cwd               → codex stdio cwd
  Hermes mcp_servers.<n>.enabled: false    → codex enabled = false

What does NOT translate (warned + skipped per server):
  Hermes-specific keys (sampling, etc.) — codex's MCP client has no
  equivalent. Listed in the per-server skipped[] field of the report.

What's NOT migrated (intentional):
  AGENTS.md — codex respects this file natively in its cwd. Hermes' own
  AGENTS.md (project-level) is already in the worktree, so codex picks
  it up without translation. No code needed.

Idempotency design:
  All managed content lives between a 'managed by hermes-agent' marker
  and the next non-mcp_servers section header. _strip_existing_managed_block
  removes the prior managed region cleanly, preserving any user-added
  codex config (model, providers.openai, sandbox profiles, etc.) above
  or below.

Files added:
  hermes_cli/codex_runtime_plugin_migration.py — pure-Python migration
    helper. Public API: migrate(hermes_config, codex_home=None,
    dry_run=False) returns MigrationReport with .migrated/.errors/
    .skipped_keys_per_server. No external TOML dependency — minimal
    formatter handles strings/numbers/booleans/lists/inline-tables.

  tests/hermes_cli/test_codex_runtime_plugin_migration.py — 39 tests
  covering:
    - per-server translation (12): stdio/http/sse, cwd, timeouts,
      enabled flag, command+url precedence, sampling drop, unknown keys
    - TOML formatter (8): types, escaping, inline tables, error case
    - existing-block stripping (4): no marker, alone, with user content
      above, with user content below
    - end-to-end migrate() (8): empty, dry-run, round-trip, idempotent
      re-run, preserves user config, error reporting, invalid input,
      summary formatting

Files changed:
  hermes_cli/codex_runtime_switch.py — apply() now calls migrate() in
    the codex_app_server enable branch. Migration failure logs a warning
    in the result message but does NOT fail the runtime change. Disable
    path (auto) explicitly skips migration.

  tests/hermes_cli/test_codex_runtime_switch.py — 3 new tests:
    test_enable_triggers_mcp_migration, test_disable_does_not_trigger_migration,
    test_migration_failure_does_not_block_enable.

All 325 feature tests green:
  - tests/agent/transports/: 249 (incl. 67 new)
  - tests/run_agent/test_codex_app_server_integration.py: 9
  - tests/hermes_cli/test_codex_runtime_switch.py: 28 (3 new)
  - tests/hermes_cli/test_codex_runtime_plugin_migration.py: 39 (new)

* perf(codex-runtime): cache codex --version check within apply()

Single /codex-runtime invocation could spawn 'codex --version' up to 3
times (state report, enable gate, success message). Each spawn is ~50ms,
so the cumulative cost wasn't a crisis, but it was wasteful and turned a
trivial slash command into something noticeably laggy on slower systems.

Refactored to lazy-once via a closure over a nonlocal cache. First call
spawns; subsequent calls in the same apply() reuse the result.

Behavior unchanged — same return shape, same error handling, same install
hint when codex is missing. Just one subprocess per call instead of three.

Two regression-guard tests added:
  - test_binary_check_cached_within_apply: enable path → call_count == 1
  - test_binary_check_cached_on_read_only_call: state-report path → call_count == 1

Total tests for /codex-runtime now 30 (was 28); all 143 codex-runtime
tests still green.

* fix(codex-runtime): correct protocol field names found via live e2e test

Three real bugs caught only by running a turn end-to-end against codex
0.130.0 with a real ChatGPT subscription. Unit tests passed because they
asserted on our own (incorrect) wire shapes; the wire format from
codex-rs/app-server-protocol/src/protocol/v2/* is the source of truth and
my initial reading of the README was incomplete.

Bug 1: thread/start.permissions wire format

Was sending {"profileId": "workspace-write"}.
Real format per PermissionProfileSelectionParams enum (tagged union):
  {"type": "profile", "id": "workspace-write"}
AND requires the experimentalApi capability declared during initialize.
AND requires a matching [permissions] table in ~/.codex/config.toml or
codex fails the request with 'default_permissions requires a [permissions]
table'.

Fix: stop overriding permissions on thread/start. Codex picks its default
profile (read-only unless user configures otherwise), which matches what
codex CLI users expect — they configure their default permission profile
in ~/.codex/config.toml the standard way. Trying to be clever about
profile selection broke every turn we tested.

Live error before fix: 'Invalid request: missing field type' on every
turn/start, even though our turn/start payload was correct — the field
codex was complaining about was inside the permissions sub-object we
shouldn't have been sending.

Bug 2: server-request method names

Was matching 'execCommandApproval' and 'applyPatchApproval'.
Real names per common.rs ServerRequest enum:
  item/commandExecution/requestApproval
  item/fileChange/requestApproval
  item/permissions/requestApproval (new third method)

Fix: match the documented names. Added handler for
item/permissions/requestApproval that always declines — codex sometimes
asks to escalate permissions mid-turn and silent acceptance would surprise
users.

Live symptom before fix: agent.log showed
'Unknown codex server request: item/commandExecution/requestApproval'
and codex stalled because we replied with -32601 (unsupported method)
instead of an approval decision. The agent reported back 'The write
command was rejected' even though Hermes never showed the user an
approval prompt.

Bug 3: approval decision values

Was sending decision strings 'approved'/'approvedForSession'/'denied'.
Real values per CommandExecutionApprovalDecision enum (camelCase):
  accept, acceptForSession, decline, cancel
(also AcceptWithExecpolicyAmendment and ApplyNetworkPolicyAmendment
variants we don't currently use).

Fix: rename _approval_choice_to_codex_decision return values; update
auto_approve_* fallbacks; update fail-closed default from 'denied' to
'decline'. Test mapping table updated to match.

Live test verified after fixes:
  $ hermes (with model.openai_runtime: codex_app_server)
  > Run the shell command: echo hermes-codex-livetest > .../proof.txt
    then read it back

  Approval prompt fired with 'Codex requests exec in <cwd>'.
  User chose 'Allow once'. Codex executed the command, wrote the file,
  read it back. Final response: 'Read back from proof.txt:
  hermes-codex-livetest'. File contents on disk match.

agent.log confirms:
  codex app-server thread started: id=019e200e profile=workspace-write
                                    cwd=/tmp/hermes-codex-livetest/workspace

All 20 session tests still green after wire-format updates.

* fix(codex-runtime): correct apply_patch approval params + ship docs

Live e2e revealed FileChangeRequestApprovalParams doesn't carry the
changeset (just itemId, threadId, turnId, reason, grantRoot) — Codex's
'reason' field describes what the patch wants to do. Test config and
display logic updated to use it. The first 'apply_patch (0 change(s))'
display from the live test is now 'apply_patch: <reason>'.

Adds website/docs/user-guide/features/codex-app-server-runtime.md
covering enable/disable, prerequisites, approval UX, MCP migration
behavior, permission profile delegation to ~/.codex/config.toml, known
limitations, and the architecture diagram. Wired into the Automation
category in sidebars.ts.

Live e2e validation across the path matrix:
  ✓ thread/start handshake
  ✓ turn/start with text input
  ✓ commandExecution items + projection
  ✓ item/commandExecution/requestApproval → Hermes UI → response
  ✓ Approve once → command runs
  ✓ Deny → command rejected, codex falls back to read-only message
  ✓ Multi-turn (codex remembers prior turn's results)
  ✓ apply_patch via Codex's fileChange path
  ✓ item/fileChange/requestApproval → Hermes UI
  ✓ MCP server migration loads inside spawned codex (verified via
    'use the filesystem MCP tool' prompt)
  ✓ /codex-runtime auto → codex_app_server toggle cycle
  ✓ Disable doesn't trigger migration
  ✓ Enable with codex CLI present succeeds + migrates
  ✓ Hermes-side interrupt path (turn/interrupt request issued cleanly
    even if codex finishes before the interrupt lands)

Known live-validated limitations now documented in the docs page:
  - delegate_task subagents unavailable on this runtime
  - permission profile selection delegated to ~/.codex/config.toml
  - apply_patch approval prompt has no inline changeset (codex protocol
    doesn't expose it)

145/145 codex-runtime tests still green.

* feat(codex-runtime): native plugin migration + UX polish (quirks 2/4/5/10/11)

Major: migrate native Codex plugins (#7 in OpenClaw's PR list)

Discovers installed curated plugins via codex's plugin/list RPC and
writes [plugins."<name>@<marketplace>"] entries to ~/.codex/config.toml
so they're enabled in the spawned Codex sessions. This is the
'YouTube-video-worthy' bit Pash highlighted: when a user has
google-calendar, github, etc. installed in their Codex CLI, those
plugins activate automatically when they enable Hermes' codex runtime.

Implementation:
  - hermes_cli/codex_runtime_plugin_migration.py: new _query_codex_plugins()
    helper spawns 'codex app-server' briefly and walks plugin/list. Returns
    (plugins, error) — failures are non-fatal so MCP migration still works.
  - render_codex_toml_section() now takes plugins + permissions args.
  - migrate() defaults: discover_plugins=True, default_permission_profile=
    'workspace-write'. Explicit None on either disables that side.
  - _strip_existing_managed_block() now also strips [plugins.*] and
    [permissions]/[permissions.*] sections inside the managed block, so
    re-runs replace plugins cleanly without touching codex's own config.

Quirk fixes:

#2 Default permissions profile written on enable.
   Without this, Codex's read-only default kicks in and EVERY write
   triggers an approval prompt. Now writes [permissions] default =
   'workspace-write' so the runtime feels normal out of the box. Set
   default_permission_profile=None to opt out.

#4 apply_patch approval prompt now shows what's changing.
   Codex's FileChangeRequestApprovalParams doesn't carry the changeset.
   Session adapter now caches the fileChange item from item/started
   notifications and looks it up by itemId when codex requests approval.
   Prompt shows '1 add, 1 update: /tmp/new.py, /tmp/old.py' instead of
   'apply_patch (0 change(s))'.

   Side benefit: also drains pending notifications BEFORE handling a
   server request, so the projector and per-turn caches are up to date
   when the approval decision fires. Bounded to 8 notifications per
   loop iter to avoid starving codex's response.

#5/#10 Exec approval prompt never shows empty cwd.
   When codex omits cwd in CommandExecutionRequestApprovalParams, fall
   back to the session's cwd. If somehow neither is available, show
   '<unknown>' explicitly instead of an empty string.

   Also surfaces 'reason' from the approval params when codex provides
   it — gives users more context on why codex wants to run something.

#11 Banner indicates the codex_app_server runtime when active.
   New 'Runtime: codex app-server (terminal/file ops/MCP run inside
   codex)' line appears in the welcome banner only when the runtime is
   on. Default banner is unchanged.

Tests:
  - 7 new tests in test_codex_runtime_plugin_migration.py covering
    plugin discovery (mocked), failure handling, dry-run skip, opt-out
    flag, idempotent re-runs, and permissions writing.
  - 3 new tests in test_codex_app_server_session.py covering the
    enriched approval prompts: cwd fallback, change summary on
    apply_patch, fallback when no item/started cache exists.
  - All 26 session tests + 46 migration tests green; 153 total in PR.

* feat(codex-runtime): hermes-tools MCP callback + native plugin migration

The big architectural addition: when codex_app_server runtime is on,
Hermes registers its own tool surface as an MCP server in
~/.codex/config.toml so the codex subprocess can call back into Hermes
for tools codex doesn't ship with — web_search, browser_*, vision,
image_generate, skills, TTS.

Also: 'migrate native codex plugins' (Pash's YouTube-video-worthy bit) —
when the user has plugins like Linear, GitHub, Gmail, Calendar, Canva
installed via 'codex plugin', Hermes discovers them via plugin/list and
writes [plugins.<name>@openai-curated] entries so they activate
automatically.

New module: agent/transports/hermes_tools_mcp_server.py
  FastMCP stdio server exposing 17 Hermes tools. Each call dispatches
  through model_tools.handle_function_call() — same code path as the
  Hermes default runtime. Run with:
    python -m agent.transports.hermes_tools_mcp_server [--verbose]

  Exposed: web_search, web_extract, browser_navigate / _click / _type /
    _press / _snapshot / _scroll / _back / _get_images / _console /
    _vision, vision_analyze, image_generate, skill_view, skills_list,
    text_to_speech.

  NOT exposed (deliberately):
    - terminal/shell/read_file/write_file/patch — codex has built-ins
    - delegate_task/memory/session_search/todo — _AGENT_LOOP_TOOLS in
      model_tools.py:493, require running AIAgent context. Documented
      as a limitation and surfaced in the slash command output.

Migration changes (hermes_cli/codex_runtime_plugin_migration.py):
  - _query_codex_plugins() spawns 'codex app-server' briefly to walk
    plugin/list and pull installed openai-curated plugins. Failures are
    non-fatal — MCP migration still completes.
  - render_codex_toml_section() now takes plugins + permissions args
    AND wraps the managed block with a MIGRATION_END_MARKER comment so
    the stripper can reliably find both ends, even when the block
    contains top-level keys (default_permissions = ...).
  - migrate() defaults: discover_plugins=True, expose_hermes_tools=True,
    default_permission_profile=':workspace' (built-in codex profile name
    — must be prefixed with ':'). All three opt-out via explicit args.
  - _build_hermes_tools_mcp_entry() builds the codex stdio entry with
    HERMES_HOME and PYTHONPATH passthrough so a worktree-launched
    Hermes points the MCP subprocess at the same module layout.

Live-caught wire bugs fixed during this turn:
  1. Permission profile config key is top-level , NOT a [permissions] table. The [permissions] table is
     for *user-defined* profiles with structured fields. Built-in
     profile names start with ':' (':workspace', ':read-only',
     ':danger-no-sandbox'). Was emitting
     which codex rejected with 'invalid type: string "X", expected
     struct PermissionProfileToml'.
  2. Built-in profile is , NOT . Codex
     rejected  with 'unknown built-in profile'.
  3. Codex's MCP layer sends  for
     tool-call confirmation. We weren't handling it, so codex stalled
     and returned 'MCP tool call was rejected'. Now: auto-accept for
     our own hermes-tools server (user already opted in by enabling
     the runtime), decline for third-party servers.

Quirk fixes shipped (from the limitations list):
  #2 default permissions: workspace profile written on enable. No more
     approval prompt on every write.
  #4 apply_patch approval shows what's changing: cache fileChange
     items from item/started, look up by itemId when codex sends
     item/fileChange/requestApproval. Prompt: '1 add, 1 update:
     /tmp/new.py, /tmp/old.py' instead of '0 change(s)'.
  #5/#10 exec approval cwd never empty: fall back to session cwd, then
     '<unknown>'. Also surfaces 'reason' from codex when present.
  #11 banner shows 'Runtime: codex app-server' line when active so
     users understand why tool counts may not match what's reachable.

Tests:
  - 5 new tests in test_codex_runtime_plugin_migration.py covering
    plugin discovery, expose_hermes_tools entry generation, idempotent
    re-runs, opt-out flag, permissions profile.
  - 3 new tests in test_codex_app_server_session.py covering enriched
    approval prompts (cwd fallback, fileChange summary).
  - 2 new tests for mcpServer/elicitation/request handling (accept
    hermes-tools, decline others).
  - New test file test_hermes_tools_mcp_server.py covering module
    surface, EXPOSED_TOOLS safety invariants (no shell/file_ops,
    no agent-loop tools), and main() error paths.
  - 166 codex-runtime tests total, all green.

Live e2e validated against codex 0.130.0 + ChatGPT subscription:
  ✓ /codex-runtime codex_app_server enables, migrates filesystem MCP,
    registers hermes-tools, writes default_permissions = ':workspace'
  ✓ Banner shows 'Runtime: codex app-server' line in subsequent sessions
  ✓ Shell command runs without approval prompt (workspace profile works)
  ✓ Multi-turn — codex remembers prior turn's results
  ✓ apply_patch path via fileChange request approval
  ✓ web_search via hermes-tools MCP callback returns real Firecrawl
    results: 'OpenAI Codex CLI – Getting Started' end-to-end in 13s
  ✓ Disable cycle clean

Docs updated: website/docs/user-guide/features/codex-app-server-runtime.md
  Full re-write covering native plugin migration, the hermes-tools
  callback architecture, the prerequisites change ('codex login is
  separate from hermes auth login codex'), the trade-off table now
  reflecting which Hermes tools work via callback, and the limitations
  list updated with what's actually unavailable on this runtime.

* feat(codex-runtime): pin user-config preservation invariant for quirk #6

Quirk #6 from the limitations list — user MCP servers / overrides /
codex-only sections in ~/.codex/config.toml that live OUTSIDE the
hermes-managed block must survive re-migration verbatim.

This already worked thanks to the MIGRATION_MARKER + MIGRATION_END_MARKER
pair I added when fixing the default_permissions wire format (so the
strip can find both ends of the managed region even with top-level
keys like default_permissions). But it was an emergent property
without a test pinning it.

Now explicitly tested:
  - User MCP server above the managed block survives migration
  - User MCP server below the managed block survives migration
  - Both above + below survive a second re-migration
  - User content (model, providers, sandbox, otel, etc.) outside our
    region is left untouched

Docs added a section "Editing ~/.codex/config.toml safely" explaining
the marker contract — so users know they can add their own MCP
servers, override permissions, configure codex-only options, etc.
without fear of Hermes overwriting their work.

167 codex-runtime tests, all green.

* docs(codex-runtime): clarify the actual tool surface — shell covers terminal/read/write/find

Previous docs and PR description undersold what codex's built-in
toolset actually provides. apply_patch alone made it sound like the
runtime could only edit files in patch format — implying you'd lose
terminal use, read_file, write_file, search/find. That was wrong.

Codex's 'shell' tool runs arbitrary shell commands inside the sandbox,
which covers everything you'd do in bash: cat/head/tail (read), echo>
or heredocs (write), find/rg/grep (search), ls/cd (navigate), build/
test/git/etc. apply_patch is for structured multi-file edits on top
of that. update_plan is its in-runtime todo. view_image loads images.
And codex has its own web_search built in (in addition to the
Firecrawl-backed one Hermes exposes via MCP callback).

Docs now have a 'What tools the model actually has' section right
after Why, breaking the surface into three clearly-labeled buckets:

  1. Codex's built-in toolset (always on) — shell, apply_patch,
     update_plan, view_image, web_search; covers everything terminal-
     adjacent.
  2. Native Codex plugins (auto-migrated from your codex plugin
     install) — Linear, GitHub, Gmail, Calendar, Outlook, Canva, etc.
  3. Hermes tool callback (MCP server in ~/.codex/config.toml) —
     web_search/web_extract via Firecrawl, browser_*, vision_analyze,
     image_generate, skill_view/skills_list, text_to_speech.

Plus a 'What's NOT available' callout listing the four agent-loop tools
(delegate_task, memory, session_search, todo) that need running
AIAgent context and can't reach the codex runtime.

Trade-offs table broken out: shell, apply_patch, update_plan,
view_image, sandbox each get their own row with a one-line description
so users can see at a glance what's available natively.

Architecture diagram updated to list the codex built-ins by name
instead of 'apply_patch + shell + sandbox'.

No code changes — purely docs clarification. 167 codex-runtime tests
still green.

* fix(codex-runtime): _spawn_background_review signature + review fork api_mode downgrade

Two real bugs in the self-improvement loop integration that the previous
test mocked away.

Bug 1: wrong call signature

The codex helper was calling self._spawn_background_review() with no
args after every turn. That function actually requires:
  messages_snapshot=list   (positional or keyword)
  review_memory=bool       (at least one trigger must be True)
  review_skills=bool

So the call would have raised TypeError at runtime — except the only
test that exercised this path mocked _spawn_background_review entirely
and just asserted spawn.called, so the wrong-arg shape never surfaced.

Bug 2: review fork inherits codex_app_server api_mode

The review fork is constructed with:
  api_mode = _parent_runtime.get('api_mode')

So when the parent is codex_app_server, the review fork ALSO runs as
codex_app_server. But the review fork's whole job is to call agent-loop
tools (memory, skill_manage) which require Hermes' own dispatch — they
short-circuit with 'must be handled by the agent loop' on the codex
runtime. So the review fork would have run, decided to save something,
called memory or skill_manage, and silently no-op'd.

Fixed in run_agent.py:_spawn_background_review() — when the parent
api_mode is 'codex_app_server', the review fork is downgraded to
'codex_responses' (same OAuth credentials, same openai-codex provider,
but talks to OpenAI's Responses API directly so Hermes owns the loop).

Also rewrote the codex helper's review wiring to match the
chat_completions path:
  - Computes _should_review_memory in the pre-loop block (was already
    being computed; now passed through to the helper as an arg).
  - Computes _should_review_skills AFTER the codex turn returns +
    counters tick (line ~15432 pattern in chat_completions).
  - Calls _spawn_background_review(messages_snapshot=, review_memory=,
    review_skills=) only when at least one trigger fires.
  - Adds the external memory provider sync (_sync_external_memory_for_turn)
    that the chat_completions path runs after every turn.

Tests:

  Replaced the broken test_background_review_invoked (which only
  asserted spawn.called) with three sharper tests:
    - test_background_review_NOT_invoked_below_threshold:
      single turn at default thresholds → no review fires (would have
      caught the original 'every turn calls spawn with no args' bug)
    - test_background_review_skill_trigger_fires_above_threshold:
      10 tool_iterations at threshold=10 → review fires with
      messages_snapshot=list, review_skills=True, counter resets
    - test_background_review_signature_never_breaks: regression guard
      asserting positional args are always empty and kwargs include
      messages_snapshot

  New TestReviewForkApiModeDowngrade class:
    - test_codex_app_server_parent_downgrades_review_fork: drives the
      real _spawn_background_review function (no mock at that level),
      asserts the review_agent gets api_mode='codex_responses' when
      the parent was codex_app_server.

Live-validated against real run_conversation:
  - Counter ticked from 0 to 5 after a 5-tool-iteration turn
  - _spawn_background_review fired exactly once with kwargs-only signature
  - review_skills=True, review_memory=False
  - messages_snapshot was 12 entries (5 assistant tool_calls + 5 tool
    results + 1 final assistant + initial system/user)
  - Counter reset to 0 after fire

170 codex-runtime tests, all green.

Docs: added a Self-improvement loop section to the codex runtime page
explaining both how the trigger logic stays equivalent and that the
review fork is auto-downgraded to codex_responses for the agent-loop
tools. Also clarified that apply_patch and update_plan ARE codex's
built-in tools (the previous version made it sound like they were
separate from 'codex's stuff' — they're not, all five tools listed
in 'What tools the model actually has' section 1 are codex built-ins).

* feat(codex-runtime): expose kanban tools through Hermes MCP callback

Kanban workers spawn as separate hermes chat -q subprocesses that read
the user's config.yaml. If model.openai_runtime: codex_app_server is set
globally (which is the whole point of opt-in), every dispatched worker
ALSO comes up on the codex runtime.

That mostly works — codex's built-in shell + apply_patch + update_plan
do the actual task work fine — but it had one critical break: the
worker handoff tools (kanban_complete, kanban_block, kanban_comment,
kanban_heartbeat) are Hermes-registered tools, not codex built-ins.
On the codex runtime, codex builds its own tool list and these never
reach the model, so the worker would do the work but not be able to
report back, hanging until the dispatcher's timeout escalates it as
zombie.

Fix: add all 9 kanban tools to the EXPOSED_TOOLS list in the Hermes
MCP callback. They dispatch statelessly through handle_function_call()
just like web_search and the others — they read HERMES_KANBAN_TASK
from env (set by the dispatcher), gate correctly (worker tools require
the env var, orchestrator tools require it unset), and write to
~/.hermes/kanban.db.

Why kanban tools work via stateless dispatch when delegate_task/memory/
session_search/todo don't: those four are listed in _AGENT_LOOP_TOOLS
(model_tools.py:493) and short-circuit in handle_function_call() with
'must be handled by the agent loop' — they need to mutate AIAgent's
mid-loop state. Kanban tools have no such requirement; they're pure
side-effect functions against the kanban.db plus state_meta.

Tools exposed:
  Worker handoff (require HERMES_KANBAN_TASK):
    kanban_complete, kanban_block, kanban_comment, kanban_heartbeat
  Read-only board queries:
    kanban_show, kanban_list
  Orchestrator (require HERMES_KANBAN_TASK unset):
    kanban_create, kanban_unblock, kanban_link

Tests:
  - test_kanban_worker_tools_exposed: complete/block/comment/heartbeat
    in EXPOSED_TOOLS (regression guard for the would-hang-worker bug)
  - test_kanban_orchestrator_tools_exposed: create/show/list/unblock/link

Docs:
  - New 'Workflow features' section in the docs page covering /goal,
    kanban, and cron behavior on this runtime
  - /goal: works fully via run_conversation feedback; only caveat is
    approval-prompt noise on long writes-heavy goals (mitigated by
    the default :workspace permission profile)
  - Kanban: enumerated which tools are reachable via the callback and
    why the env var propagates correctly through the codex subprocess
    to the MCP server subprocess
  - Cron: documented as 'not specifically tested' — same rules as the
    CLI apply since cron runs through AIAgent.run_conversation
  - Trade-offs table gained rows for /goal, kanban worker, kanban
    orchestrator

172/172 codex-runtime tests green (+2 from kanban tests).

* docs(codex-runtime): wire /codex-runtime into slash-commands ref + flag aux token cost

Three docs gaps caught during a final audit:

1. /codex-runtime was only in the feature docs page, not in the
   slash-commands reference. Added rows to both the CLI section and
   the Messaging section so users discover it where they'd look for
   slash command syntax.

2. CODEX_HOME and HERMES_KANBAN_TASK weren't in environment-variables.md.
   CODEX_HOME lets users redirect Codex CLI's config dir (the migration
   honors it). HERMES_KANBAN_TASK is set by the kanban dispatcher and
   propagates to the codex subprocess + the hermes-tools MCP subprocess
   so kanban worker tools gate correctly — documented as 'don't set
   manually' since it's an internal handoff.

3. Aux client behavior on this runtime. When openai_runtime=
   codex_app_server is on with the openai-codex provider, every aux
   task (title generation, context compression, vision auto-detect,
   session search summarization, the background self-improvement review
   fork) flows through the user's ChatGPT subscription by default.

   This is true for the existing codex_responses path too, but it's
   more visible / important here because users explicitly opted in for
   subscription billing. Added a 'Auxiliary tasks and ChatGPT
   subscription token cost' section to the docs page with a YAML
   example showing how to override specific aux tasks to a cheaper
   model (typically google/gemini-3-flash-preview via OpenRouter).

   Also documents how the self-improvement review fork gets
   auto-downgraded from codex_app_server to codex_responses by the
   fix earlier in this PR.

No code changes — pure docs. 172 codex-runtime tests still green.

* docs+test(codex-runtime): pin HOME passthrough, document multi-profile + CODEX_HOME

OpenClaw hit a real footgun in openclaw/openclaw#81562: when spawning
codex app-server they were synthesizing a per-agent HOME alongside
CODEX_HOME. That made every subprocess codex's shell tool launches
(gh, git, aws, npm, gcloud, ...) see a fake $HOME and miss the user's
real config files. They had to back it out in PR #81562 — keep
CODEX_HOME isolation, leave HOME alone.

Audit confirms Hermes' codex spawn doesn't have this problem. We do
os.environ.copy() and only overlay CODEX_HOME (when provided) and
RUST_LOG. HOME passes through unchanged. But it was an emergent
property without a test pinning it, so adding a regression guard:

  test_spawn_env_preserves_HOME — confirms parent HOME survives intact
                                  in the subprocess env
  test_spawn_env_sets_CODEX_HOME_when_provided — confirms codex_home
                                                  arg still isolates
                                                  codex state correctly

Docs additions:

  'HOME environment variable passthrough' section — calls out the
  contract explicitly: CODEX_HOME isolates codex's own state, HOME
  stays user-real so gh/git/aws/npm/etc. find their normal config.
  Cites openclaw#81562 as the cautionary tale.

  'Multi-profile / multi-tenant setups' section — addresses the
  related concern: profiles share ~/.codex/ by default. For users who
  want per-profile codex isolation (separate auth, separate plugins),
  documents the manual CODEX_HOME=<profile-scoped-dir> approach.

  Explains why we DON'T auto-scope CODEX_HOME per profile: doing so
  would silently invalidate existing codex login state for anyone
  upgrading to this PR with tokens already at ~/.codex/auth.json.
  Opt-in is safer than surprising users.

174 codex-runtime tests (+2 from HOME guards), all green.

* fix(codex-runtime): TOML control-char escapes + atomic config.toml write

Two footguns caught in a final audit pass before merge.

Bug 1: TOML control characters not escaped

The _format_toml_value() helper escaped backslashes and double quotes
but passed literal control characters (\n, \t, \r, \f, \b) through
unchanged. TOML basic strings don't allow literal control characters
— a path or env var containing a newline would produce invalid TOML
that codex refuses to load.

Realistic exposure: pathological cases like a HERMES_HOME with a
trailing newline (env var concatenation accident), or a PYTHONPATH
with a tab from a multi-line shell heredoc.

Fix: escape all five TOML basic-string control sequences (\b \t \n
\f \r) in addition to \\ and \" that we already did. Order
matters — backslash must come first or the other escapes get
re-escaped.

Bug 2: config.toml write wasn't atomic

If the python process crashed between target.mkdir() and the
write_text() finishing, a half-written config.toml could be left
behind. On NFS / Windows / some FUSE mounts this is a real concern;
on ext4/APFS small writes are usually atomic in practice but not
guaranteed.

Fix: write to a tempfile.mkstemp() temp file in the same directory,
then Path.replace() (atomic same-dir rename on POSIX, ReplaceFile on
Windows). On rename failure, clean up the temp file so repeated
failed migrations don't pile up .config.toml.* files.

Tests:
  - test_string_with_newline_escaped — \n in value → \n in output
  - test_string_with_tab_escaped — \t in value → \t in output
  - test_string_with_other_controls_escaped — \r, \f, \b
  - test_windows_path_escaped_correctly — backslash doubling
  - test_atomic_write_no_temp_leak_on_success — no .config.toml.*
    left over after a successful write
  - test_atomic_write_cleanup_on_rename_failure — temp file removed
    when Path.replace raises (simulated disk full)

180 codex-runtime tests, all green (+6 from this commit).

Footguns audited but NOT fixed (with rationale):

- Concurrent migrations race. Two Hermes processes hitting
  /codex-runtime codex_app_server within seconds of each other could
  cause one writer to lose entries. Low probability (you'd have to
  enable from two surfaces simultaneously) and low impact (just re-run
  migration). Adding fcntl/msvcrt locking is more code than it's
  worth here. The atomic rename above means each individual write is
  consistent — only the merge step is racy.

- Codex protocol version drift. We pin MIN_CODEX_VERSION=0.125 and
  check at runtime but don't reject too-new versions. Right call —
  the protocol has been stable through 0.125 → 0.130. If OpenAI
  breaks it later we'd see the error in test_codex_app_server_runtime
  on CI before users hit it.
2026-05-13 17:18:15 -07:00
vominh1919
e2b2d48610 fix(cli): preserve startup banner on terminal resize
Recover from SIGWINCH without clearing the physical screen or scrollback
buffer. The startup banner and tool summary are printed before
prompt_toolkit owns the live chrome, so they live in normal terminal
scrollback. Calling erase_screen() + \x1b[3J] on every resize removed
that UI permanently — _replay_output_history cannot reconstruct it
because the banner was never added to _OUTPUT_HISTORY.

Instead, just reset prompt_toolkit's renderer cache and invalidate so
the next incremental redraw starts from a clean slate, then let the
original on_resize handler recalculate layout for the new terminal
size. This matches the behaviour of bash/zsh/fish on SIGWINCH.

Fixes NousResearch/hermes-agent#22999
2026-05-13 13:36:31 -07:00
iuyup
d6c9711ba8 fix(security): reduce unnecessary shell=True in subprocess calls
- memory_setup.py: use shlex.split() for plugin dep checks instead of shell=True
- transcription_tools.py: avoid shell=True for auto-detected whisper commands
  (user-provided templates via env var still use shell=True for compatibility)
- cli.py: add comment clarifying intentional shell=True for user quick_commands
- Add test verifying auto-detected template is shlex-safe

Addresses CONTRIBUTING.md Priority #3 (Security hardening — shell injection).
2026-05-13 10:31:22 -07:00
NorethSea
7a4ad5ccb4 fix(cli): use display-width for response box header label to support CJK
Replace `len(label)` with `HermesCLI._status_bar_display_width(label)`
in two places where the response box top border is rendered.

`len()` counts characters, not terminal columns. CJK characters like
`测` and `试` each occupy 2 columns, causing the top border
`╭─ 测试 ───╮` to render 2 columns wider than the bottom border
`╰─────────╯`.

The `_status_bar_display_width` helper already exists (line 2881) and
uses `prompt_toolkit.utils.get_cwidth` for proper CJK width calculation.
2026-05-12 16:44:03 -07:00
helix4u
a34998ee2f fix(cli): parse positional insights days 2026-05-12 14:56:47 -07:00
Teknium
c1eb2dcda7
feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback (#24220)
* feat(security): supply-chain advisory checker + lazy-install framework + tiered install fallback

Three coordinated mitigations for the Mini Shai-Hulud worm hitting
mistralai 2.4.6 on PyPI (2026-05-12) and for the next single-package
compromise that follows.

# What this PR makes true

1. Users with the poisoned mistralai 2.4.6 in their venv get a loud
   detection banner with copy-pasteable remediation steps the moment
   they run hermes (and on every gateway startup).
2. One quarantined / yanked PyPI package can no longer silently demote
   a fresh install to 'core only' — the installer keeps every other
   extra and tells the user which tier landed.
3. Future opt-in backends (Mistral, ElevenLabs, Honcho, etc.) can
   lazy-install on first use under a strict allowlist, instead of
   eagerly pulling everything at install time.

# Detection: hermes_cli/security_advisories.py

- ADVISORIES catalog (one entry currently: shai-hulud-2026-05 for
  mistralai==2.4.6). Adding the next one is a single dataclass.
- detect_compromised() uses importlib.metadata.version() — no pip
  dependency, works in uv venvs that lack pip.
- Banner cache (~/.hermes/cache/advisory_banner_seen) rate-limits
  the startup banner to once per 24h per advisory.
- Acks persisted to security.acked_advisories in config.yaml; never
  re-banner after ack.
- Wired into:
  * hermes doctor — runs first, prints full remediation block
  * hermes doctor --ack <id> — dismisses an advisory
  * cli.py interactive run() and single-query branches — short
    stderr banner pointing at hermes doctor
  * gateway/run.py startup — operator-visible warning in gateway.log

# Lazy-install framework: tools/lazy_deps.py

- LAZY_DEPS allowlist maps namespaced feature keys (tts.elevenlabs,
  memory.honcho, provider.bedrock, etc.) to pip specs.
- ensure(feature) installs missing deps in the active venv via the
  uv → pip → ensurepip ladder (matches tools_config._pip_install).
- Strict spec safety regex rejects URLs, file paths, shell metas,
  pip flag injection, control chars — only PyPI-by-name accepted.
- Gated on security.allow_lazy_installs (default true) plus the
  HERMES_DISABLE_LAZY_INSTALLS env var for restricted/audited envs.
- Migrated three backends as proof of pattern:
  * tools/tts_tool.py — _import_elevenlabs() calls ensure first
  * plugins/memory/honcho/client.py — get_honcho_client lazy-installs
  * tts.mistral / stt.mistral entries pre-registered for when PyPI
    restores mistralai

# Installer fallback tiers

scripts/install.sh, scripts/install.ps1, setup-hermes.sh:

- Centralised _BROKEN_EXTRAS list (currently: mistral). Edit one
  array when a transitive breaks; users keep every other extra.
- New 'all minus known-broken' tier between [all] and the existing
  PyPI-only-extras tier. Only kicks in when [all] fails resolve.
- All three tiers explicit: every fallback announces which tier
  landed and prints a re-run hint when not on Tier 1.
- install.ps1 and install.sh both regenerate their tier specs from
  the same _BROKEN_EXTRAS array so updates stay in sync.

Side effect: install.ps1 Tier 2 spec previously hardcoded 'mistral'
in its extra list — bug fixed by the refactor (mistral is filtered
out).

# Config

hermes_cli/config.py — DEFAULT_CONFIG.security gains:
- acked_advisories: []  (advisory IDs the user has dismissed)
- allow_lazy_installs: True  (security gate for ensure())

No config version bump needed — both keys nest under existing
security: block, and load_config's deep-merge picks up DEFAULT_CONFIG
defaults for users with older configs.

# Tests

tests/hermes_cli/test_security_advisories.py — 23 tests covering:
- detect_compromised matches/non-matches, wildcard frozenset
- ack persistence, idempotence, blank rejection, config-failure path
- banner cache rate limiting + 24h re-banner + ack-stops-banner
- short_banner_lines / full_remediation_text / render_doctor_section /
  gateway_log_message
- shipped catalog well-formedness invariant

tests/tools/test_lazy_deps.py — 40 tests covering:
- spec safety: 11 safe parametrized + 18 unsafe parametrized
- allowlist: unknown-feature rejection, namespace.name shape,
  every shipped spec passes the safety regex
- security gating: config flag, env var, default, fail-open
- ensure() happy/sad paths: already-satisfied, install success,
  pip stderr surfaced on failure, install-succeeds-but-still-missing
- is_available, feature_install_command

Combined: 63 new tests, all passing under scripts/run_tests.sh.

# Validation

- scripts/run_tests.sh tests/hermes_cli/test_security_advisories.py
  tests/tools/test_lazy_deps.py → 63/63 passing
- scripts/run_tests.sh tests/hermes_cli/test_doctor.py
  tests/hermes_cli/test_doctor_command_install.py
  tests/tools/test_tts_mistral.py tests/tools/test_transcription_tools.py
  tests/tools/test_transcription_dotenv_fallback.py → 165/165 passing
- scripts/run_tests.sh tests/hermes_cli/ tests/tools/ →
  9191 passed, 8 pre-existing failures (verified on origin/main
  before this change)
- bash -n on install.sh and setup-hermes.sh → OK
- py_compile on all modified .py files → OK
- End-to-end smoke test of detect_compromised + render_doctor_section
  + gateway_log_message with mocked installed version → produces
  copy-pasteable remediation output

# Community

Full advisory + remediation steps:
website/docs/community/security-advisories/shai-hulud-mistralai-2026-05.md

Short-form post drafts (Discord, GitHub pinned issue, README banner):
scripts/community-announcement-shai-hulud.md

Refs: PR #24205 (mistral disabled), Socket Security advisory
<https://socket.dev/blog/mini-shai-hulud-worm-pypi>

* build(deps): pin every direct dep to ==X.Y.Z (no ranges)

Companion to the supply-chain advisory work: replace every >=/</~= range
in pyproject.toml's [project.dependencies] and [project.optional-dependencies]
with an exact ==X.Y.Z pin sourced from uv.lock.

Why: ranges allow PyPI to ship a fresh version of any direct dep at any
time without a code review on our side. With ranges, the malicious
mistralai 2.4.6 release would have been pulled by every fresh
'pip install -e .[all]' for the hours between upload and PyPI's
quarantine — exactly the install window we got hit on. Exact pins close
that window: the only way a new package version reaches a user is via
an intentional update on our end.

What the user-facing change is: nothing, behavior-wise. Every package
resolves to the same version it was already resolving to via uv.lock —
the pins just remove the resolver's freedom to pick a different one.

Cost: any user installing Hermes alongside another package that requires
a newer pin gets a resolver conflict. Acceptable for our isolated-venv
install path; documented in the new comment block.

Build-system requires line (setuptools>=61.0) is intentionally left
as a range — pinning the build backend would block fresh pip from
bootstrapping the build on architectures where that exact wheel isn't
available.

mistral extra (mistralai==2.3.0) is pinned but stays out of [all]
(per PR #24205). 'uv lock' regeneration will fail until PyPI restores
mistralai; lockfile regeneration is gated behind that, NOT on every PR.

LAZY_DEPS in tools/lazy_deps.py also moved to exact pins so the lazy-
install pathway can never resolve a different version than the one
declared in pyproject.toml.

Validation:

- Cross-checked all 77 pinned direct deps in pyproject.toml against
  uv.lock — every pin matches the resolved version exactly.
- Cross-checked all LAZY_DEPS specs against uv.lock — same.
- 'uv pip install -e .[all] --dry-run' resolves 205 packages cleanly.
- tests/tools/test_lazy_deps.py + tests/hermes_cli/test_security_advisories.py
  → 63/63 passing (every shipped spec passes the safety regex).
- Doctor + TTS + transcription targeted suite → 146/146 passing.

* build(deps): hash-verify transitives via uv.lock; remove unresolvable [mistral] extra

You asked: 'what about the dependencies the dependencies rely on?' —
correctly noting that exact-pinning direct deps in pyproject.toml does
NOT cover the transitive graph. `pip install` and `uv pip install` both
re-resolve transitives fresh from PyPI at install time, so a compromised
transitive (e.g. `httpcore` if it got worm-poisoned tomorrow) would
still hit our users even with every direct dep exact-pinned.

# What this commit fixes

1. **Both real installer scripts now prefer `uv sync --locked` as Tier 0.**
   uv.lock records SHA256 hashes for every transitive — a compromised
   package with a different hash gets REJECTED. Falls through to the
   existing `uv pip install` cascade if the lockfile is missing or
   stale, with a loud warning that the fallback path does NOT
   hash-verify transitives. Previously only `setup-hermes.sh` (the dev
   path) used the lockfile; `scripts/install.sh` and `scripts/install.ps1`
   (the paths fresh users actually run) skipped it.

2. **Removed the `[mistral]` extra entirely.** The `mistralai` PyPI
   project is fully quarantined right now — every version returns 404,
   so any pin we wrote was unresolvable, which broke `uv lock --check`
   in CI. Restoration is documented in pyproject.toml as a 5-step
   checklist (verify, re-add extra, re-enable in 4 modules, regenerate
   lock, optionally re-add to [all]).

3. **Regenerated uv.lock.** 262 packages, mistralai/eval-type-backport/
   jsonpath-python pruned. `uv lock --check` now passes.

# Defense-in-depth view

| Layer                      | Where             | Protects against                          |
|----------------------------|-------------------|-------------------------------------------|
| Exact pins in pyproject    | direct deps       | new mistralai 2.4.6-style direct compromise |
| uv.lock + `--locked` install | transitive graph  | transitive worm injection                  |
| Tier-0 hash-verified path  | install.sh / .ps1 | actually USE the lockfile in fresh installs |
| `uv lock --check` CI gate  | every PR          | drift between pyproject and lockfile      |
| `hermes_cli/security_advisories.py` | runtime  | cleanup for users who already got hit      |

The exact pinning + hash verification together close the supply-chain
gap. Without the lockfile path, exact pins alone are theater.

# Validation

- `uv lock --check` → passes (262 packages resolved, no drift).
- `bash -n` on install.sh + setup-hermes.sh → OK.
- 209/209 tests passing across new + adjacent test files
  (test_lazy_deps.py, test_security_advisories.py, test_doctor.py,
  test_tts_mistral.py, test_transcription_tools.py).
- TOML parse OK.

* chore: remove community announcement drafts (PR body covers it)

* build(deps): lazy-install every opt-in backend (anthropic, search, terminal, platforms, dashboard)

Extends the lazy-install framework to cover everything that's not used by
every hermes session. Base install drops from ~60 packages to 45.

Moved out of core dependencies = []:
- anthropic   (only when provider=anthropic native, not via aggregators)
- exa-py, firecrawl-py, parallel-web (search backends; only when picked)
- fal-client  (image gen; only when picked)
- edge-tts    (default TTS but still optional)

New extras in pyproject.toml: [anthropic] [exa] [firecrawl] [parallel-web]
[fal] [edge-tts]. All added to [all].

New LAZY_DEPS entries: provider.anthropic, search.{exa,firecrawl,parallel},
tts.edge, image.fal, memory.hindsight, platform.{telegram,discord,matrix},
terminal.{modal,daytona,vercel}, tool.dashboard.

Each import site now calls ensure() before importing the SDK. Where the
module had a top-level try/except (telegram, discord, fastapi), the
graceful-fallback pattern was extended to lazy-install on first
check_*_requirements() call and re-bind module globals.

Updated test_windows_native_support.py tzdata check from snapshot
(>=2023.3 literal) to invariant (any version + win32 marker).

Validation:
- Base install: 45 packages (was ~60); 6 newly-extracted packages absent
- uv lock --check: passes (262 packages, no drift)
- 209/209 lazy_deps + advisory + doctor + tts/transcription tests passing
- py_compile clean on all 12 modified modules
2026-05-12 01:02:25 -07:00
Teknium
ea1d0462cf
fix(cli): vertical fallback for markdown tables wider than terminal (#23948)
Follow-up to #23863 (CJK table alignment). The realigner was
correctly padding pipes to identical column offsets, but when a
table's natural width exceeds terminal cells it produced lines that
the terminal soft-wrapped mid-cell, destroying column alignment
visually even though the bytes were perfectly padded. Reported as
'columns are not aligned' on tables containing one long row alongside
several short rows.

Approach mirrors Claude Code's MarkdownTable.tsx narrow-terminal
fallback: when realign_markdown_tables is given an available_width
budget and the rebuilt horizontal table exceeds it, render each body
row as 'Header: value' lines separated by a thin ─ rule. Word-wraps
oversize values at the budget with a 2-space continuation indent.

- agent/markdown_tables.py: realign_markdown_tables(text, available_width=None);
  threshold check at the top of _render_block flips into a new
  _render_vertical fallback. Includes _wrap_to_width with hard-break
  for tokens longer than the budget.
- cli.py: helper _terminal_width_for_streaming() returns
  shutil.get_terminal_size().columns minus _STREAM_PAD and a 2-cell
  safety margin; passed to all three realign call sites
  (_render_final_assistant_content for strip+render Panel paths, and
  the streaming flushers in _emit_stream_text / _flush_stream).
- tests/agent/test_markdown_tables.py: 4 new tests covering the
  overflow-vertical fallback for ASCII + CJK content, the
  'fits → keep horizontal' case, and the long-cell wrap with indent.

Live-verified: with COLUMNS=100, the user's reported 'long row in
ASCII table' case now renders as vertical key-value rows that all fit
the panel; the 6-column CJK comparison table still renders as an
aligned horizontal table because it fits inside 100 cols.
2026-05-11 16:49:13 -07:00
kshitij
2ec8d2b42f
chore: ruff auto-fix PLR6201 — tuple → set in membership tests (#23937)
Replace  with  for all literal-tuple
membership tests. Set lookup is O(1) vs O(n) for tuple — consistent
micro-optimization across the codebase.

608 instances fixed via `ruff --fix --unsafe-fixes`, 0 remaining.
133 files, +626/-626 (net zero).
2026-05-11 11:13:25 -07:00
Teknium
1d00716754
fix(cli,tui): align CJK / wide-char markdown tables (#23863)
CJK and emoji glyphs render as two terminal cells but JS String#length
and the model's own padding count them as one, so any markdown table
with Chinese / Japanese / Korean cells drifts right per row when a
real terminal renders it. Both surfaces fix this with a display-cell
width measurement (wcswidth on the Python side, stringWidth on the
TUI side).

Changes:
- agent/markdown_tables.py: new helper. realign_markdown_tables(text)
  detects markdown table blocks (header + |---| divider) and
  rewrites the row padding using wcwidth.wcswidth so every pipe and
  dash lines up across rows. No-op on text without tables.
- cli.py: hook the helper into _render_final_assistant_content for
  strip / render modes (raw passes through untouched), and into the
  streaming line emitter so live token-by-token rendering also
  produces aligned tables. A small two-buffer state machine in
  _emit_stream_text holds table rows until the block ends, then
  flushes them through the realigner so all rows pad to a single
  per-column width.
- ui-tui/src/components/markdown.tsx: renderTable now uses
  stringWidth (Bun.stringWidth fast path + East-Asian-width-aware
  fallback, already memoised in @hermes/ink) instead of UTF-16
  String#length for both column-width measurement and per-cell
  padding. Drops the comment that documented the bug as a deliberate
  limitation.

Validation:
- New tests/agent/test_markdown_tables.py (11): every rebuilt block
  shares pipe column offsets across rows for pure CJK, mixed
  CJK+emoji, ragged-row, and multi-table inputs.
- Updated tests/cli/test_cli_markdown_rendering.py: the existing
  strip-mode test asserted exact whitespace; rewritten to assert the
  alignment contract (cell content survives + every rendered row
  shares pipe offsets).
- New ui-tui markdown.test.ts case (1): rendered column-2 start
  offset is identical for the header + every body row, including
  the CJK row that drifted before the fix.
- Live: hermes chat -q with the user-reported screenshot prompt now
  produces a perfectly aligned table on the wire (header, divider,
  4 body rows including '通义千问', all pipes at identical columns).
2026-05-11 11:13:06 -07:00
kshitij
657874460f
chore: ruff auto-fixes — collapsible-else-if, if-stmt-min-max, dict.fromkeys (#23926)
PLR5501 (collapsible-else-if): 28 instances — else: if: → elif:
PLR1730 (if-stmt-min-max):   15 instances — if x<y: x=y → x=max(x,y)
C420   (dict.fromkeys):       2 instances — dictcomp → dict.fromkeys
PLR1704 (redefined-argument): 1 instance — reason → err_msg (shadow fix)
C414   (unnecessary-list):    1 instance — sorted(list(x)) → sorted(x)

28 files, -44 net lines. All mechanical, zero logic changes.
17,211 tests pass, zero regressions.
2026-05-11 11:03:29 -07:00
Teknium1
cc9e788c14 fix(cli): defensive _slash_confirm_state access + AUTHOR_MAP
- getattr(self, '_slash_confirm_state', None) at the two read sites that
  trip object.__new__(HermesCLI) test fixtures (test_cli_external_editor,
  test_cli_skin_integration)
- _build_tui_layout_children: make slash_confirm_widget keyword-only with
  default None to avoid breaking subclassing extension hook for wrapper
  CLIs (test_cli_extension_hooks)
- AUTHOR_MAP entry for zhengyn0001

Follow-up to the salvaged commit ca1d4375a.
2026-05-11 10:02:03 -07:00
zhengyuna
054f568578 fix: use TUI modal for slash confirmations 2026-05-11 10:02:03 -07:00
Teknium
3e7145e0bb
revert: roll back /goal checklist + /subgoal feature stack (#23813)
* Revert "fix(goals): force judge to use tool calls instead of JSON-text replies (#23547)"

This reverts commit a63a2b7c78.

* Revert "fix(goals): forward standing /goal state on auto-compression session rotation (#23530)"

This reverts commit 4a080b1d5a.

* Revert "feat(goals): /goal checklist + /subgoal user controls (#23456)"

This reverts commit 404640a2b7.
2026-05-11 07:06:27 -07:00
Mibayy
ebf2ea584a feat(terminal,cli): docker_extra_args + display.timestamps
Two independent opt-in QoL toggles, both off by default.

terminal.docker_extra_args:
- List of extra flags appended verbatim to docker run after security
  defaults. Useful for adding capabilities (e.g. --cap-add SETUID) or
  other docker run options not exposed by existing config keys.
- Non-string entries are logged and skipped.
- Also available via TERMINAL_DOCKER_EXTRA_ARGS='[...]' env var.

display.timestamps:
- Appends [HH:MM] to user input bullet and the assistant response box
  header. Single hub in _format_submitted_user_message_preview()
  covers both single-line and multi-line user previews; assistant
  response label gets the timestamp at box-open time.

Closes #1569 (timestamps).

Co-authored-by: Mibayy <Mibayy@users.noreply.github.com>
2026-05-10 22:43:39 -07:00
Teknium
404640a2b7
feat(goals): /goal checklist + /subgoal user controls (#23456)
* feat(goals): /goal checklist + /subgoal user controls

Two-phase judge for /goal — Phase A decomposes the goal into a detailed
checklist on first turn; Phase B evaluates each pending item harshly
against the agent's most recent response. The goal completes only when
every item is in a terminal status (completed or impossible). Adds
/subgoal so the user can append, complete, mark impossible, undo,
remove, or clear items the judge missed or got wrong.

Mechanics:
- GoalState gains `checklist` and `decomposed` fields, both backwards
  compatible (old state_meta rows load unchanged).
- Phase A: aux call writes a harsh, exhaustive checklist; biased toward
  more items not fewer. Falls through to legacy freeform judge when
  decompose fails.
- Phase B: judge gets the checklist + last-response snippet + path to
  a per-session conversation dump at <HERMES_HOME>/goals/<sid>.json.
  A bounded read_file tool (max 5 calls per turn, restricted to that
  one file) lets the judge inspect history when the snippet is
  ambiguous. Stickiness in code: terminal items are frozen, only the
  user can revert via /subgoal undo.
- Continuation prompt shows checklist progress when non-empty;
  reverts to old prompt when empty.
- Status line shows M/N done counts.

CLI + gateway + TUI gateway all pass the agent reference into
evaluate_after_turn so the dump can be written. Gateway-side
/subgoal is allowed mid-run since it only modifies the checklist
the judge consults at turn boundaries.

Tests: 24 new cases — backcompat round-trip, Phase A decompose,
Phase B updates + new_items + stickiness, user override flows,
conversation dump (incl. unsafe-sid sanitization), judge read_file
restriction. Existing freeform-mode tests updated to patch the
renamed `judge_goal_freeform` and skip Phase A explicitly.

* fix(goals): off-by-one in judge index, message-list plumbing, prompt tuning

Three live-test findings from running /goal end-to-end against
gemini-3-flash-preview as the judge:

1. Off-by-one bug — the judge sees the checklist rendered with 1-based
   indices ('1. [ ] foo, 2. [ ] bar') but the apply layer indexed
   state.checklist as 0-based. Result: every judge update landed on
   the wrong item, evidence got attached to neighbouring rows, and
   the genuine 'first pending' item (usually #1) never got marked.
   Fix: convert 1 → 0 in _parse_evaluate_response. Also tightened the
   user prompt to call out the 1-based scheme explicitly. New tests
   cover the parser conversion + an end-to-end fake-judge round-trip.

2. Conversation dump never happened — _extract_agent_messages tried
   common AIAgent attribute names (.messages, .conversation_history,
   etc.) but AIAgent doesn't expose the message list as an instance
   attribute; it lives inside run_conversation()'s scope. Result: the
   judge's read_file tool always saw history_path=unavailable. Fix:
   added an explicit messages= kwarg to evaluate_after_turn that all
   three call sites (CLI, gateway, TUI gateway) now pass directly.
   Agent-attribute extraction kept as back-compat fallback.

3. Prompt was too harsh on simple goals. The original 'be HARSH,
   default to leaving items pending' wording made the judge refuse
   to mark 'file exists' completed even after the agent ran ls,
   test -f, os.path.isfile, and find — burning the entire 8-turn
   budget on a fizzbuzz task. Softened to 'strict but not absurd'
   with explicit guidance on what counts as evidence and a directive
   not to require re-proving items already established earlier.

Re-tested live with the same fizzbuzz goal: now terminates in 2
turns with all 8 checklist items correctly attributed to their
own evidence. /subgoal user-action flow (add / complete / undo /
impossible) verified live as well.
2026-05-10 16:56:51 -07:00
Teknium
c5f1f863ac
fix(cli): drive _prompt_text_input directly when off main thread (#23454)
Slash commands (/clear, /new, /undo, /reload-mcp) are dispatched from the
process_loop daemon thread.  prompt_toolkit.run_in_terminal returns a
coroutine that only the main-thread event loop can drive, so calling it
from a daemon thread orphans the coroutine — the input prompt never
renders and user keystrokes leak into the composer instead of the
confirmation prompt (issue #23185).

Mirror the thread-aware guard already in _run_curses_picker: when off the
main thread, fall back to a direct input() call.  Also wrap
run_in_terminal in try/except so WSL / Warp / other emulators that
silently drop the scheduled coroutine fall back to input() too.

Tests: tests/cli/test_prompt_text_input_thread_safety.py covers main
thread (run_in_terminal path), daemon thread (direct input fallback),
no-app, run_in_terminal-raises, and EOF handling.
2026-05-10 16:16:10 -07:00
teknium1
00ce5f04d9 feat(session): make /handoff actually transfer the session live
Builds on @kshitijk4poor's CLI handoff stub. The original PR's flow
deferred everything to whenever a real user happened to message the
target platform; this rewrites it so the gateway picks up handoffs
immediately and the destination chat just starts working.

State machine on sessions table replaces the boolean flag:
  None -> 'pending' -> 'running' -> ('completed' | 'failed')
plus handoff_error for failure reasons. CLI request_handoff /
get_handoff_state / list_pending_handoffs / claim_handoff /
complete_handoff / fail_handoff helpers wrap the transitions.

CLI side (cli.py): /handoff <platform> validates the platform's home
channel via load_gateway_config, refuses if the agent is mid-turn,
flips the row to 'pending', and poll-blocks (60s) on terminal state.
On 'completed' it prints the /resume hint and exits the CLI like
/quit. On 'failed' or timeout it surfaces the reason and the CLI
session stays intact.

Gateway side (gateway/run.py): new _handoff_watcher background task
scans state.db every 2s, atomically claims pending rows, and runs
_process_handoff for each. _process_handoff:

  1. Resolves the platform's home channel.
  2. Asks the adapter for a fresh thread via the new
     create_handoff_thread(parent_chat_id, name) capability so the
     handed-off conversation gets its own scrollback. Adapters that
     don't support threads (or fail) return None and the watcher
     falls back to the home channel directly.
  3. Constructs a SessionSource keyed as 'thread' when a thread was
     created, 'dm' otherwise, then session_store.switch_session
     re-binds the destination key to the CLI session_id. The full
     role-aware transcript replays via load_transcript on the next
     turn (no flat-text injection into context_prompt).
  4. Forges a synthetic MessageEvent(internal=True) with the handoff
     notice and dispatches through _handle_message; the agent runs
     against the loaded transcript and adapter.send delivers the
     reply.
  5. Marks the row 'completed' on success, 'failed' (+error) on any
     exception.

Adapter capability (gateway/platforms/base.py): create_handoff_thread
default returns None. Three overrides:

  - Telegram (gateway/platforms/telegram.py): wraps _create_dm_topic
    so DM topics (Bot API 9.4+) and forum supergroups both work.
  - Discord (gateway/platforms/discord.py): parent.create_thread on
    text channels with a seed-message + message.create_thread
    fallback for permission edge cases. Skips DMs and other
    non-thread-capable parents.
  - Slack (gateway/platforms/slack.py): posts a seed message and
    returns its ts as the thread anchor — Slack threads are
    message-anchored.

In thread mode, build_session_key keys the destination without
user_id (thread_sessions_per_user defaults to False) so the synthetic
turn and any later real-user message in the thread share the same
session_key — seamless takeover without race.

CommandDef stays cli_only=True (handoff is initiated from the CLI;
gateway exposes /resume for the reverse direction).

Removed the original PR's _handle_message_with_agent handoff hook
(transcript-as-text injection into context_prompt) and the
send_message_tool notification — both replaced by the watcher path.

Tests rewritten around the new state machine: 13/13 pass.
E2E-validated thread + no-thread paths and the failure path against
real worktree imports with mocked adapters.
2026-05-10 13:06:25 -07:00
kshitijk4poor
878611a79d feat(session): add /handoff command for cross-platform session transfer
Adds /handoff <platform> CLI command that queues the current session for
resume on the configured home channel of any messaging platform.

CLI side:
- /handoff telegram — marks session in shared DB, sends summary to
  the Telegram home channel via send_message
- /handoff discord — same for Discord
- Supports telegram, discord, slack, whatsapp, signal, matrix

Gateway side:
- On new session creation, checks for pending handoffs for the
  incoming message's platform
- If found, loads the CLI session's full conversation history and
  injects it into the context prompt as a handoff transcript
- Agent continues the conversation seamlessly

Files:
- hermes_state.py: handoff_pending, handoff_platform columns + helpers
- cli.py: _handle_handoff_command dispatch + handler
- hermes_cli/commands.py: CommandDef entry
- gateway/run.py: handoff detection in _handle_message_with_agent
- tests/hermes_cli/test_session_handoff.py: 8 tests
2026-05-10 13:06:25 -07:00
Teknium
68e44642c8
fix(stream-retry): collapse two-line drop status, name provider, and let agent.log capture diagnostics (#22993)
Subagent stream drops were spamming the parent terminal with two lines
per blip ('Connection dropped...' + 'Reconnected...') while leaving zero
breadcrumb in agent.log to debug them.

Two underlying bugs, fixed together:

1. quiet_mode raised the run_agent/tools/etc. loggers to ERROR, which
   filters records before root-logger file handlers see them. The comment
   claimed 'File handlers still capture everything' — that was wrong.
   Removed in both run_agent.py and cli.py; console quietness already
   comes from hermes_logging not installing a console StreamHandler in
   non-verbose mode.

2. The stream-retry blocks emitted two _emit_status calls per drop
   ('⚠️ Connection dropped... Reconnecting...' + '🔄 Reconnected —
   resuming…') with no provider name, so multi-provider sessions had to
   dig through agent.log to attribute a drop. Replaced both call sites
   with a single _emit_stream_drop helper that emits ONE line naming the
   provider and error class, and always writes a structured WARNING to
   agent.log with subagent_id, depth, provider, base_url, error_type.

Net UX change: 6 lines per triple-subagent drop → 3 lines, each
naming the provider. agent.log now has a structured breadcrumb per
retry that didn't exist before.

Tests: 6 new tests in tests/run_agent/test_stream_drop_logging.py
covering the logger-level guard, structured WARNING content, single
status line per drop (no Reconnected follow-up), and provider naming.
2026-05-09 22:35:35 -07:00
Teknium
b67ea7ff47
perf(cli): skip welcome banner on chat -q single-query mode (#22904)
`hermes chat -q "..."` printed the full welcome banner before
running the query — kawaii ASCII logo, available toolsets list,
available skills list, model name, session ID, working directory,
update-available notice. Building it took ~420 ms on cold start
(~200 ms version-update probe, the rest is toolset / skill enumeration
plus Rich panel rendering).

For a one-shot `-q` query the banner is noise: the user already
picked the prompt, doesn't need a toolset reference, and gets the
session ID + resume hint from `_print_exit_summary()` after the
response prints.

The fully-quiet `-Q` / `--quiet` machine-readable path was already
banner-free; this brings the human-facing single-query path in line
so all non-interactive invocations are fast.

Measured impact (`hermes chat -q "ok" --max-turns 1`, 10-run
percentiles, 9950X3D):
  median:  1.90 → 1.75 s  (-150 ms)
  min:     1.80 → 1.73 s  ( -70 ms)
  P25:     1.82 → 1.74 s  ( -80 ms)

Wider variance than expected; the banner cost overlaps with API
latency on real `chat -q` runs. Min-time delta of 70 ms is the
cleanest signal — that's the deterministic banner-build cost gone.
The 150 ms median delta picks up cases where the version-update
probe also finishes during the wait.

Interactive mode (`hermes` with no `-q`) and the `--list-tools` /
`--list-toolsets` one-shot listing commands still show the banner —
those are the contexts where it's actually wanted.

Tests: 656/656 `tests/cli/` pass on top of latest main (modulo 5 pre-
existing flakes in `test_cli_save_config_value.py` that fail with
`No module named 'ruamel'` both with and without this change).
2026-05-09 18:20:28 -07:00
ming
85383c6363 fix(cli): preserve config comments on setting writes 2026-05-09 17:55:12 -07:00
Wesley Simplicio
116a1446a4 fix(terminal): bridge docker_env config to TERMINAL_DOCKER_ENV
Problem: terminal.docker_env set in config.yaml was silently ignored.
Docker containers never received the user-specified env vars.

Root cause: docker_env was missing from all three config→env bridging
maps (cli.py env_mappings, gateway/run.py _terminal_env_map,
hermes_cli/config.py _config_to_env_sync) and from the terminal_tool
_get_env_config() reader. _create_environment() consumed the key from
container_config correctly, but it was always {} because TERMINAL_DOCKER_ENV
was never set.

Also extend the list-serialisation branches in cli.py and gateway/run.py
to handle dict values via json.dumps (lists already used json.dumps;
plain str() on a dict produces undecodable output).

Fix:
- cli.py: add "docker_env": "TERMINAL_DOCKER_ENV" to env_mappings;
  serialise dict values with json.dumps alongside existing list path
- gateway/run.py: same additions to _terminal_env_map and serialisation
- hermes_cli/config.py: add "terminal.docker_env": "TERMINAL_DOCKER_ENV"
  to _config_to_env_sync so `hermes config set terminal.docker_env …`
  persists to .env correctly
- tools/terminal_tool.py: add docker_env key to _get_env_config() reading
  TERMINAL_DOCKER_ENV via _parse_env_var with default "{}"

Tests: add test_docker_env_is_bridged_everywhere to
tests/tools/test_terminal_config_env_sync.py — stash-verified: fails on
origin/main, passes with fix.

Fixes #20537
2026-05-09 17:53:35 -07:00
Teknium
c7f0aab949
feat(openrouter): wire Pareto Code router with min_coding_score knob (#22838)
Pick openrouter/pareto-code as your model and OpenRouter auto-routes each
request to the cheapest model meeting your coding-quality bar (ranked by
Artificial Analysis). The new openrouter.min_coding_score config key (0.0-1.0,
default 0.65) tunes the floor.

- hermes_cli/models.py: add openrouter/pareto-code to OPENROUTER_MODELS so
  it shows up in the picker with a description
- hermes_cli/config.py: add openrouter.min_coding_score (default 0.65 — lands
  on a mid-tier coder on the current Pareto frontier)
- plugins/model-providers/openrouter: emit extra_body.plugins =
  [{id: pareto-router, min_coding_score: X}] when model is openrouter/pareto-code
  AND the score is a valid float in [0.0, 1.0]
- agent/transports/chat_completions.py: same emission on the legacy flag
  path (when no provider profile is loaded)
- run_agent.py: openrouter_min_coding_score kwarg + storage; plumbed into
  both build_kwargs() invocations and the context-summary extra_body path
- cli.py: read openrouter.min_coding_score once at init, validate float in
  [0,1], pass to AIAgent constructions (CLI + background-task paths)
- cron/scheduler.py, batch_runner.py, tools/delegate_tool.py,
  tui_gateway/server.py: propagate the kwarg (mirrors providers_order
  plumbing — subagents inherit, cron/batch read from config)
- tests: profile-level + transport-level coverage of the model gating,
  unset/empty/out-of-range handling, and the legacy flag path
- docs: new 'OpenRouter Pareto Code Router' section in providers.md

Verified end-to-end against api.openrouter.ai: at score=0.65 we land on a
mid-tier coder, at omission we get the strongest. Score is silently dropped
on any model other than openrouter/pareto-code, so it's safe to leave set.
2026-05-09 14:47:00 -07:00
Teknium
70bc52e408
fix(cli): make Ctrl+Enter insert newline on WSL/SSH/Windows Terminal (#22777)
Native Windows, WSL, SSH sessions, and Windows Terminal all send
Ctrl+Enter as bare LF (c-j). Hermes was binding c-j as submit on
every POSIX platform, so Ctrl+Enter submitted instead of inserting
a newline on those terminals. Reported in #22379.

Add _preserve_ctrl_enter_newline() predicate that detects the
environments where Ctrl+Enter must produce a newline (sys.platform
== 'win32', SSH_CONNECTION/SSH_CLIENT/SSH_TTY env, WT_SESSION,
WSL_DISTRO_NAME, /proc/version 'microsoft' marker). Gate the
c-j-as-submit binding off in those environments and gate the
c-j-as-newline handler on. Local POSIX TTYs without those markers
(docker exec, plain ssh from a Mac) keep c-j as submit so plain
Enter still works on thin PTYs.

Add install_ctrl_enter_alias() in hermes_cli/pt_input_extras.py
mapping the three CSI-u / modifyOtherKeys variants of Ctrl+Enter
('\x1b[13;5u', '\x1b[27;5;13~', '\x1b[27;5;13u') to the
(Escape, ControlM) tuple Alt+Enter produces. This lets Kitty /
mintty / xterm-with-modifyOtherKeys users over SSH get a Ctrl+Enter
newline through the existing Alt+Enter handler.

9 new tests + extended existing test_lf_enter_binds_to_submit_handler_posix
to cover bare-local vs SSH branches.

Closes #22379.
2026-05-09 12:48:14 -07:00
Teknium
b9c001116e
feat: confirm prompt for destructive slash commands (#4069) (#22687)
/clear, /new, /reset, and /undo now ask the user to confirm before
discarding conversation state — three-option prompt routed through the
existing tools.slash_confirm primitive.

Native yes/no buttons render on Telegram, Discord, and Slack (their
adapters already implement send_slash_confirm); other platforms get a
text-fallback prompt and reply with /approve, /always, or /cancel.

The classic prompt_toolkit CLI uses the same three-option flow via the
established _prompt_text_input pattern (see _confirm_and_reload_mcp).
TUI keeps its existing modal overlay (#12312).

Gated by new config key approvals.destructive_slash_confirm (default
true). Picking 'Always Approve' flips the gate to false so subsequent
destructive commands run silently — matches the established
mcp_reload_confirm UX.

Out of scope: /cron remove (separate domain — scheduled jobs, not
session history). Existing TUI overlay env-var (HERMES_TUI_NO_CONFIRM)
left unchanged; cosmetic unification can come later.

Closes #4069.
2026-05-09 11:04:46 -07:00
kshitij
2a7047c2ed
fix(sqlite): fall back to journal_mode=DELETE on NFS/SMB/FUSE (#22043)
SQLite's WAL mode requires shared-memory (mmap) coordination and fcntl
byte-range locks that don't reliably work on network filesystems. Upstream
documents this explicitly:
  https://www.sqlite.org/wal.html#sometimes_queries_return_sqlite_busy_in_wal_mode

On NFS / SMB / some FUSE mounts / WSL1, 'PRAGMA journal_mode=WAL' raises
'sqlite3.OperationalError: locking protocol' (SQLITE_PROTOCOL). Before
this change, every feature backed by state.db or kanban.db broke silently:
  - /resume, /title, /history, /branch returned 'Session database not
    available.' with no cause
  - gateway logged the init failure at DEBUG (invisible in errors.log)
  - kanban dispatcher crashed every 60s, driving the known migration race
    (duplicate column name: consecutive_failures, #21708 / #21374)

Changes:
  - hermes_state.apply_wal_with_fallback(): shared helper that tries WAL
    and falls back to DELETE on SQLITE_PROTOCOL-style errors with one
    WARNING explaining why
  - hermes_state.get_last_init_error() + format_session_db_unavailable():
    capture the init failure cause and surface it in user-facing strings
    (with an NFS/SMB pointer for 'locking protocol')
  - hermes_cli/kanban_db.connect(): use the shared helper
  - gateway/run.py: bump SessionDB init failure log DEBUG -> WARNING
    (matches cli.py's existing correct behavior)
  - cli.py (4 sites) + gateway/run.py (5 sites): replace bare
    'Session database not available.' with format_session_db_unavailable()

Tests: 12 new tests in tests/test_hermes_state_wal_fallback.py + 1 new
test in tests/hermes_cli/test_kanban_db.py. Existing suites (state,
kanban, gateway, cli) remain green for all tests unrelated to pre-existing
failures on main.

Evidence: real-world user on NFSv3 mount (172.26.224.200:d2dfac12/home,
local_lock=none) reporting 'Session database not available.' on /resume;
'locking protocol' appears in 4 distinct log entries across backup,
kanban, TUI, and CLI paths in the same session.

closes #22032
2026-05-09 02:09:35 -07:00
Syed Abdur Rehman Ali
f5b635f6ab feat(cli): recognise Shift+Enter as a newline key
Closes #5346.

Most terminals send the same byte sequence for `Enter` and `Shift+Enter`
by default, so the application can't tell them apart — this is a terminal
protocol limitation, not something Hermes can paper over. But terminals
that implement the Kitty keyboard protocol (Kitty / foot / WezTerm /
Ghostty by default; iTerm2 / Alacritty / VS Code terminal / Warp once the
protocol is enabled) DO emit a distinct sequence for `Shift+Enter`:

  - `\x1b[13;2u`     — Kitty / CSI-u, modifier=2
  - `\x1b[27;2;13~`  — xterm modifyOtherKeys=2

Stock prompt_toolkit doesn't have the CSI-u sequence in its
`ANSI_SEQUENCES` table at all, and it maps the modifyOtherKeys variant to
plain `Keys.ControlM` (Enter) — i.e. it strips the Shift modifier, which
is the bug users actually hit on iTerm2 and friends.

This PR adds `hermes_cli/pt_input_extras.install_shift_enter_alias()`,
called once at CLI startup from `cli.py`, which inserts/overwrites those
sequences in `ANSI_SEQUENCES` so they decode to `(Keys.Escape, Keys.ControlM)`
— the same key tuple `Alt+Enter` produces. The existing Alt+Enter newline
handler (`@kb.add('escape', 'enter')` in `cli.py`) then fires unchanged,
so there is no new keybinding to register and no behavioral change for
terminals that don't emit the distinct sequences.

Files
=====

* `hermes_cli/pt_input_extras.py` — new module hosting the helper. Lives
  outside `cli.py` so it's importable in tests without dragging in the
  full CLI runtime (which depends on `fire`, `rich`, etc.).
* `cli.py` — calls `install_shift_enter_alias()` once at module import.
  Wrapped in try/except so prompt_toolkit version drift can't break CLI
  startup.
* `tests/cli/test_cli_shift_enter_newline.py` — 6 tests:
  - registration of all three byte sequences
  - overwrite of stock prompt_toolkit's broken modifyOtherKeys mapping
  - idempotency
  - parser equivalence: CSI-u Shift+Enter == Alt+Enter
  - parser equivalence: modifyOtherKeys Shift+Enter == Alt+Enter
  - plain Enter remains a single key (submit), distinct from the two-key
    Alt+Enter / Shift+Enter tuple
* `website/docs/user-guide/cli.md` — keybinding table updated; new
  "Shift+Enter compatibility" subsection with a per-terminal status table
  noting macOS Terminal / stock Windows Terminal cannot distinguish the
  keystroke at the protocol level.
* `website/docs/getting-started/quickstart.md`,
  `website/docs/guides/tips.md` — short mention pointing readers at the
  full compatibility note in `cli.md`.

Tested
======

  pytest tests/cli/test_cli_shift_enter_newline.py        # 6 passed

Live-tested by triggering `\x1b[13;2u` against the running Vt100Parser
(see test). Not exercised in a real terminal end-to-end because that
requires a Kitty-protocol-capable host; the test exercises the parser
path that drives the live terminal too.
2026-05-08 16:26:51 -07:00
Teknium
26bac67ef9
fix(entry-points): guard hermes_bootstrap import so partial updates don't brick hermes (#22091)
teknium1 hit ModuleNotFoundError: No module named 'hermes_bootstrap' after
a code update, on both his Windows machine AND his Linux workstation.  The
failure mode is real and affects every user who updates hermes by any path
OTHER than a fully-successful ``hermes update``.

## What happens

hermes_bootstrap.py is a top-level module registered via pyproject.toml's
``py-modules`` list (added by Brooklyn's Windows UTF-8 stdio work).  It
must be registered in the venv's editable-install .pth file before Python
can find it as a bare ``import hermes_bootstrap``.

``hermes update`` handles this correctly: (1) git reset --hard, (2) clear
__pycache__, (3) uv pip install -e . (re-registers the package including
the new py-modules list), (4) restart.

BUT if any step AFTER (1) fails — network blip during pip install, PEP 668
on a system Python, venv locked, uv not in PATH, a crash mid-update — the
user is left with new code that references hermes_bootstrap and a venv
that doesn't know about it.  Every hermes invocation after that crashes
with ModuleNotFoundError, including ``hermes update`` itself.  No recovery
path without manual `uv pip install -e .`.

Also affects users who ``git pull`` the repo directly without running
hermes update — relatively common for developers.

## Fix

Wrap ``import hermes_bootstrap`` in a try/except ModuleNotFoundError
across all 6 entry points (hermes_cli/main, run_agent, gateway/run,
acp_adapter/entry, cli, batch_runner).  On Windows, missing bootstrap
means the UTF-8 stdio setup doesn't run — degraded behavior (Unicode
chars may fail to print) but NOT a crash.  POSIX is unaffected either way
since the bootstrap is a no-op there.

Once hermes is running again, the user can ``hermes update`` to fully
recover.

## Test update

tests/test_hermes_bootstrap.py::test_entry_point_imports_bootstrap
scans for the first top-level import in each entry point and asserts it
is hermes_bootstrap.  Extended the check to accept a Try block whose body
is a lone Import of hermes_bootstrap — that's the recovery-friendly form
we just introduced.

Verified behavior by ``mv hermes_bootstrap.py hermes_bootstrap.py.bak``
and confirming ``python -c "import hermes_cli.main"`` succeeds.  82/82
tests pass (hermes_bootstrap + windows-native + windows-compat).
2026-05-08 14:43:13 -07:00
Teknium
0ba1e12abc fix(windows): browser tool + spurious SIGINT from subprocess spawning
Three related Windows-only fixes that together make the browser toolset
actually usable on Windows. Symptom chain: user invokes browser_navigate
-> tool returns {"success": false, "error": "Daemon process exited
during startup with no error output"} and the CLI exits mid-turn with
the session summary.

Root cause (3 layers):

1. tools/browser_tool.py::_find_agent_browser() resolved
   node_modules/.bin/agent-browser to the extensionless POSIX shell
   shim via Path.exists(). On Windows, CreateProcessW cannot execute
   that script (WinError 193 "not a valid Win32 application"). Fix:
   delegate to shutil.which with path=node_modules/.bin so PATHEXT
   picks up agent-browser.CMD on Windows and the extensionless shim
   stays correct on POSIX.

2. Windows Terminal / Win32 delivers a spurious CTRL_C_EVENT to the
   parent hermes.exe whenever a background thread spawns a .cmd
   subprocess. Python 3.11's default SIGINT handler raises
   KeyboardInterrupt in MainThread, which unwinds prompt_toolkit's
   app.run() -> cli.py::run()'s finally block calls _run_cleanup()
   -> _emergency_cleanup_all_sessions -> spawns a concurrent
   _run_browser_command("close", ...) on the same session the agent
   thread just opened. Two agent-browser processes race on the same
   --session name, the daemon startup loses, and the tool returns
   the "Daemon process exited during startup" error. Fix: install a
   Windows-only SIGINT handler that absorbs the signal silently.
   Real user Ctrl+C still routes through prompt_toolkit's own c-c
   keybinding at the TUI layer, which is how Claude Code handles the
   same quirk (driving cancellation via the TUI key handler, not
   signals).

3. In tools/browser_tool.py, both Popen sites now pass
   creationflags=CREATE_NO_WINDOW | STARTF_USESTDHANDLES with
   close_fds=True on Windows. CREATE_NO_WINDOW suppresses the .cmd
   console flash; STARTF_USESTDHANDLES + close_fds ensures the child
   inherits only our three chosen handles (DEVNULL stdin, temp-file
   stdout/stderr) and no leaked parent console handles that could
   confuse agent-browser's native daemon spawn. Notably we do NOT
   add CREATE_NEW_PROCESS_GROUP - on Python 3.11 Windows the flag
   interacts badly with asyncio's ProactorEventLoop and makes things
   worse.

Verified end-to-end on Windows 10 / Windows Terminal / PowerShell:
browser_navigate to https://example.com returns
{"success": true, "title": "Example Domain"} and the CLI stays alive
for follow-up tool calls and assistant turns.

Refs: earlier Windows quirks commits 1cebb3bad (Ctrl+Enter newline),
26f5af52a (environment hints), aefd1a37f (Playwright Chromium).
2026-05-08 14:27:40 -07:00
Teknium
d1838041e5 feat: Ctrl+Enter inserts newline on Windows Terminal
Windows Terminal intercepts Alt+Enter for its fullscreen shortcut, leaving
Windows users with no Enter-involving way to insert a newline in the Hermes
prompt. Fix it by reclaiming c-j on Windows only:

- _bind_prompt_submit_keys now binds c-j (LF) to submit only on POSIX, where
  thin PTYs (docker exec, some SSH configs) deliver Enter as LF. On Windows
  plain Enter is always c-m, so c-j is free.
- Windows-only prompt binding: c-j inserts a newline. Windows Terminal sends
  Ctrl+Enter as LF, so the user-facing keystroke is Ctrl+Enter — no terminal
  settings changes required.
- Alt+Enter binding unchanged; still works on mac/Linux/WSL.
- Test TestPromptToolkitTerminalCompatibility::test_lf_enter_binds_to_submit_handler
  split into platform-aware assertions for POSIX vs win32.
- Fixed the Ctrl+J claim in hermes_cli/tips.py (was wrong before this commit
  even on POSIX) to point Windows users at Ctrl+Enter.

Tradeoff: on Windows, raw Ctrl+J (without Enter) also inserts a newline,
since WT collapses Ctrl+Enter and Ctrl+J to the same c-j keycode. No
conflicting Hermes binding existed for Ctrl+J, so this is a harmless side
effect.
2026-05-08 14:27:40 -07:00
Teknium
cbce5e93fc codebase: add encoding='utf-8' to all bare open() calls (PLW1514)
Closes the last Python-on-Windows UTF-8 exposure by making every
text-mode open() call explicit about its encoding.

Before: on Windows, bare open(path, 'r') defaults to the system
locale encoding (cp1252 on US-locale installs).  That means reading
any config/yaml/markdown/json file with non-ASCII content either
crashes with UnicodeDecodeError or silently mis-decodes bytes.

After: all 89 affected call sites in production code now pass
encoding='utf-8' explicitly.  Works identically on every platform
and every locale, no surprise behavior.

Mechanical sweep via:
  ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix     --exclude 'tests,venv,.venv,node_modules,website,optional-skills,               skills,tinker-atropos,plugins' .

All 89 fixes have the same shape: open(x) or open(x, mode) became
open(x, encoding='utf-8') or open(x, mode, encoding='utf-8').  Nothing
else changed.  Every modified file still parses and the Windows/sandbox
test suite is still green (85 passed, 14 skipped, 0 failed across
tests/tools/test_code_execution_windows_env.py +
tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py +
tests/test_hermes_bootstrap.py).

Scope notes:
  - tests/ excluded: test fixtures can use locale encoding intentionally
    (exercising edge cases).  If we want to tighten tests later that's
    a separate PR.
  - plugins/ excluded: plugin-specific conventions may differ; plugin
    authors own their code.
  - optional-skills/ and skills/ excluded: skill scripts are user-authored
    and we don't want to mass-edit them.
  - website/ and tinker-atropos/ excluded: vendored / generated content.

46 files touched, 89 +/- lines (symmetric replacement).  No behavior
change on POSIX or on Windows when the file is ASCII; bug fix on
Windows when the file contains non-ASCII.
2026-05-08 14:27:40 -07:00
Teknium
d94fb47717 hermes_bootstrap: Windows-only UTF-8 stdio shim for all entry points
Codebase-wide fix for Python-on-Windows UTF-8 footguns, complementing
the earlier execute_code sandbox fixes (which remain load-bearing for
when the sandbox explicitly scrubs child env).

Problem: Python on Windows has two long-standing text-encoding pitfalls:

  1. sys.stdout/stderr are bound to the console code page (cp1252 on
     US-locale installs) — print('café') crashes with UnicodeEncodeError.
  2. Subprocess children don't know to use UTF-8 unless PYTHONUTF8 and/or
     PYTHONIOENCODING are set in their env — so any Python we spawn
     (linters, sandbox children, delegation workers) hits the same bug.

Solution: A tiny bootstrap module (hermes_bootstrap.py) imported as the
first statement of every Hermes entry point:

  - hermes_cli/main.py   (hermes / hermes-agent console_script)
  - run_agent.py         (hermes-agent direct)
  - acp_adapter/entry.py (hermes-acp)
  - gateway/run.py       (messaging gateway)
  - batch_runner.py      (parallel batch mode)
  - cli.py               (legacy direct-launch CLI)

On Windows, the bootstrap:
  - os.environ.setdefault('PYTHONUTF8', '1')       (PEP 540 UTF-8 mode)
  - os.environ.setdefault('PYTHONIOENCODING', 'utf-8')
  - sys.stdout/stderr/stdin.reconfigure(encoding='utf-8', errors='replace')

Children inherit the env vars → they run in UTF-8 mode.
Current process's stdio is reconfigured → print('café') works now.

On POSIX (Linux/macOS), the bootstrap is a complete no-op.  We don't
touch LANG, LC_*, or anything else — users who have intentionally
configured a non-UTF-8 locale aren't affected.  POSIX systems are
already UTF-8 by default in 99% of modern setups, so there's nothing
to fix.

setdefault() (not overwrite) means users who explicitly set PYTHONUTF8=0
or PYTHONIOENCODING=cp1252 in their environment are respected.

What this does NOT fix: bare open(path, 'w') calls in the *parent*
process still default to locale encoding because PYTHONUTF8 is only
read at interpreter init.  A ruff PLW1514 sweep (separate follow-up)
will add explicit encoding='utf-8' at those ~219 call sites for
belt-and-suspenders.

Tests (17): 16 passed, 1 skipped on Windows.
  - Windows: env vars set, stdio reconfigured, child inherits UTF-8 mode
  - POSIX: complete no-op (verified on fake POSIX + skipped on real
    POSIX since we don't have a Linux box in this session)
  - Idempotence: multiple calls safe
  - Graceful degradation: non-reconfigurable streams don't crash
  - User opt-out: explicit PYTHONUTF8=0 is respected
  - Load order: every entry point's FIRST top-level import is
    hermes_bootstrap, enforced by an AST-level parametrized test

pyproject.toml: added hermes_bootstrap to py-modules so it ships with
pip installs.
2026-05-08 14:27:40 -07:00
Teknium
e93bfc6c93 feat(windows): close remaining POSIX-only landmines — TUI crash, kanban waitpid, AF_UNIX sandbox, /bin/bash, npm .cmd shims, cwd tracking, detach flags
Second pass on native Windows support, driven by a systematic audit across
five areas: POSIX-only primitives (signal.SIGKILL/SIGHUP/SIGPIPE, os.WNOHANG,
os.setsid), path translation bugs (/c/Users → C:\Users), subprocess patterns
(npm.cmd batch shims, start_new_session no-op on Windows), subsystem health
(cron, gateway daemon, update flow), and module-level import guards.

Every change is platform-gated — POSIX (Linux/macOS) behaviour is preserved
bit-identical. Explicit "do no harm" test: test_posix_path_preserved_on_linux,
test_posix_noop, test_windows_detach_popen_kwargs_is_posix_equivalent_on_posix.

## New module

- hermes_cli/_subprocess_compat.py — shared helpers (resolve_node_command,
  windows_detach_flags, windows_hide_flags, windows_detach_popen_kwargs).
  All no-ops on non-Windows.

## CRITICAL fixes (would crash or silently break on Windows)

- tui_gateway/entry.py: SIGPIPE/SIGHUP referenced at module top level would
  AttributeError on import on Windows, breaking `hermes --tui` entirely (it
  spawns this module as a subprocess).  Guard each signal.signal() call with
  hasattr() and add SIGBREAK as Windows' SIGHUP equivalent.

- hermes_cli/kanban_db.py: os.waitpid(-1, os.WNOHANG) in dispatcher tick was
  unguarded.  os.WNOHANG doesn't exist on Windows.  Gate the whole reap loop
  behind `os.name != "nt"` — Windows has no zombies anyway.

- tools/code_execution_tool.py: AF_UNIX socket for execute_code RPC fails on
  most Windows builds.  Fall back to loopback TCP (AF_INET on 127.0.0.1:0
  ephemeral port) when _IS_WINDOWS.  HERMES_RPC_SOCKET env var now accepts
  either a filesystem path (POSIX) or `tcp://127.0.0.1:<port>` (Windows).
  Generated sandbox client parses both.

- cron/scheduler.py: `argv = ["/bin/bash", str(path)]` hardcoded.  Use
  shutil.which("bash") so Windows (Git Bash via MinGit) works, with a
  readable error when bash is genuinely absent.

- 6 bare npm/npx spawn sites: tools_config.py x2, doctor.py, whatsapp.py
  (npm install + node version probe), browser_tool.py x2.  On Windows npm
  is npm.cmd / npx is npx.cmd (batch shims); subprocess.Popen(["npm", ...])
  fails with WinError 193.  shutil.which(...) returns the absolute .cmd
  path which CreateProcessW accepts because the extension routes through
  cmd.exe /c.  POSIX behaviour unchanged (shutil.which still returns the
  same path subprocess would resolve itself).

## HIGH fixes (silent misbehaviour on Windows)

- tools/environments/local.py get_temp_dir: hardcoded /tmp returned on
  Windows meant `_cwd_file = "/tmp/hermes-cwd-*.txt"`, which bash wrote
  via MSYS2's virtual /tmp but native Python couldn't open.  Result: cwd
  tracking silently broken — `cd` in terminal tool did nothing.  Windows
  branch now returns `%HERMES_HOME%/cache/terminal` with forward slashes
  (works in both bash and Python, guaranteed no spaces).

- tools/environments/local.py _make_run_env PATH injection: `/usr/bin not
  in split(":")` heuristic mangles Windows PATH (";" separator).  Gate
  the injection behind `not _IS_WINDOWS`.

- hermes_cli/gateway.py launch_detached_profile_gateway_restart: outer
  Popen + watcher-script Popen both used start_new_session=True, which
  Windows silently ignores.  Watcher stayed attached to CLI's console,
  died when user closed terminal after `hermes update`, left gateway
  stale.  Now branches through windows_detach_popen_kwargs() helper
  (CREATE_NEW_PROCESS_GROUP | DETACHED_PROCESS | CREATE_NO_WINDOW on
  Windows, start_new_session=True on POSIX — identical to main).

## MEDIUM fixes

- gateway/run.py /restart and /update handlers: hardcoded bash/setsid
  chain crashes on Windows when user triggers /update in-gateway.  Now
  has sys.platform=="win32" branch using sys.executable + a tiny
  Python watcher with proper detach flags.  POSIX path is unchanged.

- cli.py _git_repo_root: Git on Windows sometimes returns /c/Users/...
  style paths that break subprocess.Popen(cwd=...) and Path().resolve().
  Added _normalize_git_bash_path() helper that translates /c/Users,
  /cygdrive/c, /mnt/c variants to native C:\Users form.  POSIX no-op.
  _git_repo_root() now routes every result through it.

- cli.py worktree .worktreeinclude: os.symlink on directories failed
  hard on Windows (requires admin or Developer Mode).  Falls back to
  shutil.copytree with a warning log.

## Tests

- 29 new tests in tests/tools/test_windows_native_support.py covering:
  subprocess_compat helpers, TUI entry signal guards, kanban waitpid
  guard, code_execution TCP fallback source-level invariants, cron bash
  resolution, npm/npx bare-spawn lint per-file, local env Windows temp
  dir, PATH injection gating, git bash path normalization, symlink
  fallback, gateway detached watcher flags.

- One existing test assertion adjusted in test_browser_homebrew_paths:
  it compared captured Popen argv to the BARE `"npx"` literal; after the
  shutil.which() change argv[0] is the absolute path.  New assertion
  checks the shape (two items, second is `agent-browser`) rather than
  the exact first-item string.  Behaviour unchanged; test was too strict.

All 56 tests pass on Linux (30 from previous commits + 26 new).
267 tests from the affected files/dirs (browser, code_exec, local_env,
process_registry, kanban_db, windows_compat) all pass — zero regressions.
tests/hermes_cli/ (3909 pass) and tests/gateway/ (5021 pass) unchanged;
all pre-existing test failures confirmed unrelated via `git stash` re-run.

## What's still deferred (LOW priority)

- Visible cmd-window flashes on short-lived console apps (~14 sites) —
  cosmetic, needs a follow-up pass once we have user reports.
- agent/file_safety.py POSIX-only security deny patterns — separate
  hardening task.
- tools/process_registry.py returning "/tmp" as fallback — theoretical;
  reachable only when all env-var candidates fail.
2026-05-08 14:27:40 -07:00
Teknium
9de893e3b0 feat(windows): close native-Windows install gaps — crash-free startup, UTF-8 stdio, tzdata dep, docs
Native Windows (with Git for Windows installed) can now run the Hermes CLI
and gateway end-to-end without crashing.  install.ps1 already existed and
the Git Bash terminal backend was already wired up — this PR fills the
remaining gaps discovered by auditing every Windows-unsafe primitive
(`signal.SIGKILL`, `os.kill(pid, 0)` probes, bare `fcntl`/`termios`
imports) and by comparing hermes against how Claude Code, OpenCode, Codex,
and Cline handle native Windows.

## What changed

### UTF-8 stdio (new module)
- `hermes_cli/stdio.py` — single `configure_windows_stdio()` entry point.
  Flips the console code page to CP_UTF8 (65001), reconfigures
  `sys.stdout`/`stderr`/`stdin` to UTF-8, sets `PYTHONIOENCODING` + `PYTHONUTF8`
  for subprocesses.  No-op on non-Windows.  Opt out via `HERMES_DISABLE_WINDOWS_UTF8=1`.
- Called early in `cli.py::main`, `hermes_cli/main.py::main`, and
  `gateway/run.py::main` so Unicode banners (box-drawing, geometric
  symbols, non-Latin chat text) don't `UnicodeEncodeError` on cp1252
  consoles.

### Crash sites fixed
- `hermes_cli/main.py:7970` (hermes update → stuck gateway sweep): raw
  `os.kill(pid, _signal.SIGKILL)` → `gateway.status.terminate_pid(pid, force=True)`
  which routes through `taskkill /T /F` on Windows.
- `hermes_cli/profiles.py::_stop_gateway_process`: same fix — also
  converted SIGTERM path to `terminate_pid()` and widened OSError catch
  on the intermediate `os.kill(pid, 0)` probe.
- `hermes_cli/kanban_db.py:2914, 3041`: raw `signal.SIGKILL` →
  `getattr(signal, "SIGKILL", signal.SIGTERM)` fallback (matches the
  pattern already used in `gateway/status.py`).

### OSError widening on `os.kill(pid, 0)` probes
Windows raises `OSError` (WinError 87) for a gone PID instead of
`ProcessLookupError`.  Widened the catch at:
- `gateway/run.py:15101` (`--replace` wait-for-exit loop — without this,
  the loop busy-spins the full 10s every Windows gateway start)
- `hermes_cli/gateway.py:228, 460, 940`
- `hermes_cli/profiles.py:777`
- `tools/process_registry.py::_is_host_pid_alive`
- `tools/browser_tool.py:1170, 1206`

### Dashboard PTY graceful degradation
`hermes_cli/pty_bridge.py` depends on `fcntl`/`termios`/`ptyprocess`,
none of which exist on native Windows.  Previously a Windows dashboard
would crash on `import hermes_cli.web_server` because of a top-level
import.  Now:
- `hermes_cli/web_server.py` wraps the pty_bridge import in
  `try/except ImportError` and sets `_PTY_BRIDGE_AVAILABLE=False`.
- The `/api/pty` WebSocket handler returns a friendly "use WSL2 for
  this tab" message instead of exploding.
- Every other dashboard feature (sessions, jobs, metrics, config
  editor) runs natively on Windows.

### Dependency
- `pyproject.toml`: add `tzdata>=2023.3; sys_platform == 'win32'` so
  Python's `zoneinfo` works on Windows (which has no IANA tzdata
  shipped with the OS).  Credits @sprmn24 (PR #13182).

### Docs
- README.md: removed "Native Windows is not supported"; added
  PowerShell one-liner and Git-for-Windows prerequisite note.
- `website/docs/getting-started/installation.md`: new Windows section
  with capability matrix (everything native except the dashboard
  `/chat` PTY tab, which is WSL2-only).
- `website/docs/user-guide/windows-wsl-quickstart.md`: reframed as
  "WSL2 as an alternative to native" rather than "the only way".
- `website/docs/developer-guide/contributing.md`: updated
  cross-platform guidance with the `signal.SIGKILL` / `OSError`
  rules we enforce now.
- `website/docs/user-guide/features/web-dashboard.md`: acknowledged
  native Windows works for everything except the embedded PTY pane.

## Why this shape

Pulled from a survey of how other agent codebases handle native
Windows (Claude Code, OpenCode, Codex, Cline):

- All four treat Git Bash as the canonical shell on Windows, same as
  hermes already does in `tools/environments/local.py::_find_bash()`.
- None of them force `SetConsoleOutputCP` — but they don't have to,
  Node/Rust write UTF-16 to the Win32 console API.  Python does not get
  that for free, so we flip CP_UTF8 via ctypes.
- None of them ship PowerShell-as-primary-shell (Claude Code exposes
  PS as a secondary tool; scope creep for this PR).
- All of them use `taskkill /T /F` for force-kill on Windows, which
  is exactly what `gateway.status.terminate_pid(force=True)` does.

## Non-goals (deliberate scope limits)

- No PowerShell-as-a-second-shell tool — worth designing separately.
- No terminal routing rewrite (#12317, #15461, #19800 cluster) — that's
  the hardest design call and needs a separate doc.
- No wholesale `open()` → `open(..., encoding="utf-8")` sweep (Tianworld
  cluster) — will do as follow-up if users hit actual breakage; most
  modern code already specifies it.

## Validation

- 28 new tests in `tests/tools/test_windows_native_support.py` — all
  platform-mocked, pass on Linux CI.  Cover:
  - `configure_windows_stdio` idempotency, opt-out, env-preservation
  - `terminate_pid` taskkill routing, failure → OSError, FileNotFoundError fallback
  - `getattr(signal, "SIGKILL", …)` fallback shape
  - `_is_host_pid_alive` OSError widening (Windows-gone-PID behavior)
  - Source-level checks that all entry points call `configure_windows_stdio`
  - pty_bridge import-guard present in `web_server.py`
  - README no longer says "not supported"
- 12 pre-existing tests in `tests/tools/test_windows_compat.py` still pass.
- `tests/hermes_cli/` ran fully (3909 passed, 9 failures — all confirmed
  pre-existing on main by stash-test).
- `tests/gateway/` ran fully (5021 passed, 1 pre-existing failure).
- `tests/tools/test_process_registry.py` + `test_browser_*` pass.
- Manual smoke: `import hermes_cli.stdio; import gateway.run;
  import hermes_cli.web_server` — all clean, `_PTY_BRIDGE_AVAILABLE=True`
  on Linux (as expected).

## Files

- New: `hermes_cli/stdio.py`, `tests/tools/test_windows_native_support.py`
- Modified: `cli.py`, `gateway/run.py`, `hermes_cli/main.py`,
  `hermes_cli/profiles.py`, `hermes_cli/gateway.py`,
  `hermes_cli/kanban_db.py`, `hermes_cli/pty_bridge.py`,
  `hermes_cli/web_server.py`, `tools/browser_tool.py`,
  `tools/process_registry.py`, `pyproject.toml`, `README.md`, and 4
  docs pages.

Credits to everyone whose prior PR work informed these fixes — see
the co-author trailers.  All of the PRs listed in
`~/.hermes/plans/windows-support-prs.md` fixing `os.kill` / `signal.SIGKILL`
/ UTF-8 stdio / tzdata / README patterns found the same issues; this PR
consolidates them.

Co-authored-by: Philip D'Souza <9472774+PhilipAD@users.noreply.github.com>
Co-authored-by: Arecanon <42595053+ArecaNon@users.noreply.github.com>
Co-authored-by: XiaoXiao0221 <263113677+XiaoXiao0221@users.noreply.github.com>
Co-authored-by: Lars Hagen <1360677+lars-hagen@users.noreply.github.com>
Co-authored-by: Luan Dias <65574834+luandiasrj@users.noreply.github.com>
Co-authored-by: Ruzzgar <ruzzgarcn@gmail.com>
Co-authored-by: sprmn24 <oncuevtv@gmail.com>
Co-authored-by: adybag14-cyber <252811164+adybag14-cyber@users.noreply.github.com>
Co-authored-by: Prasanna28Devadiga <54196612+Prasanna28Devadiga@users.noreply.github.com>
2026-05-08 14:27:40 -07:00
Teknium
850413f120 feat(computer-use): cua-driver backend, universal any-model schema
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
2026-05-08 11:07:38 -07:00
Teknium
674fad1483
fix(goals): Ctrl+C during /goal loop auto-pauses the goal (#21888)
Reported: Ctrl+C during an active /goal loop felt like it did nothing —
the agent would interrupt the current turn, then immediately queue another
continuation and keep going until the session ended or the 20-turn budget
ran out.

Root cause: cli.py's _maybe_continue_goal_after_turn() ran in the finally:
block around self.chat(...) unconditionally. Whether the turn completed
normally, got interrupted, or returned an empty string, the judge ran on
whatever was in conversation_history and — because the judge is fail-open
— a "continue" verdict pushed another CONTINUATION_PROMPT onto
_pending_input. Ctrl+C was invisible to the hook.

Fix:
- chat() now captures result['interrupted'] onto self._last_turn_interrupted
  (resets to False at entry so early-returns don't leak prior state).
- _maybe_continue_goal_after_turn() checks the flag first: on interrupt,
  auto-pause via mgr.pause(reason='user-interrupted (Ctrl+C)') and print
  a one-liner pointing the user at /goal resume or /goal clear. No judge
  call, no continuation enqueued.
- Also added an empty-response guard that mirrors gateway/run.py's
  _handle_message logic (empty reply → transient failure → skip judging
  so we don't trip the consecutive-parse-failures backstop unnecessarily).

The goal stays in the DB as paused, so /goal resume recovers it after
the user has sorted out whatever made them cancel. /goal clear still
works as before for a full stop.

Tests: tests/cli/test_cli_goal_interrupt.py covers:
  - interrupted turn pauses + doesn't queue + judge is NOT called
  - paused goal is resumable
  - empty / whitespace / missing assistant reply skips judging
  - healthy turn still enqueues continuation / marks done
  - chat() resets _last_turn_interrupted at entry (anti-leak guard)

All 55 existing goal tests still pass.
2026-05-08 06:53:13 -07:00
kshitij
7338e5d9ba
fix(model-switch): prevent stale Ollama credentials after provider switch (#21703)
When switching from a custom local provider (e.g. ollama-launch) to a
cloud provider, two bugs caused the CLI to misbehave:

1. _explicit_api_key/_explicit_base_url were only updated when the switch
   result had non-empty values (guarded by `if result.api_key:` etc.).
   If the previous provider set these to Ollama values ("ollama",
   "http://127.0.0.1:11434/v1"), those stale values leaked into the next
   turn's _ensure_runtime_credentials() call and were forwarded to the
   new provider's API endpoint, causing authentication/routing failures.

   Fix: unconditionally write result.api_key/base_url into the explicit
   fields after every successful switch. An empty string is the correct
   sentinel — it tells _ensure_runtime_credentials to re-resolve from the
   auth store / config rather than forwarding a stale override.

2. In AIAgent.switch_model(), `self.base_url = base_url or self.base_url`
   kept the old Ollama localhost URL whenever the incoming base_url was an
   empty string. For providers that use a native SDK (not an OpenAI-compat
   endpoint), the caller passes base_url="" and expects the agent to clear
   the field — not silently inherit Ollama's address.

   Fix: only update self.base_url when base_url is truthy.

3. _handle_model_picker_selection() was called from the prompt_toolkit
   Enter key binding without any exception guard. Any unexpected error
   in the model-selection code path propagated through prompt_toolkit's
   key-binding dispatcher and caused the entire TUI to exit — which the
   user sees as "the terminal exits when I switch providers".

   Fix: wrap the call in try/except and close the picker on failure.
2026-05-08 14:28:54 +05:30
Austin Pickett
d87c7b99e2
fix(analytics): prevent silent token loss and add Claude 4.5–4.7 pricing (#21455)
- Add pricing entries for Claude Opus 4.5/4.6/4.7, Sonnet 4.5/4.6, and
  Haiku 4.5 with updated source URLs (platform.claude.com)
- Add _normalize_anthropic_model_name() to handle dot-notation variants
  (e.g. claude-opus-4.7 → claude-opus-4-7) for pricing lookups
- Fix silent token loss: ensure session row exists before UPDATE in both
  run_agent.py and hermes_state.py (INSERT OR IGNORE is idempotent)
- Log token persistence failures at DEBUG level instead of swallowing
  them silently — makes undercounted analytics diagnosable
- Surface reasoning tokens in CLI /usage and TUI usage panel
- Add 'reasoning' and 'cost_status' fields to TUI Usage type
2026-05-07 13:24:31 -07:00
oluwadareab12
edbbc96b55 fix(cli): replace get_event_loop() with get_running_loop() to silence RuntimeWarning in process_loop thread (#19285) 2026-05-07 06:35:54 -07:00
Teknium
6e46f99e7e
fix(tui): surface backend error as visible text when final_response is empty (#21245)
When the provider rejects a request (e.g. invalid model slug like
'--provider nous --model kimi-k2.6' where the valid slug is
'moonshotai/kimi-k2.6'), run_conversation() returns
{failed: True, error: <detail>, final_response: None}. The TUI gateway
and one-shot CLI mode both dropped the error on the floor and emitted
an empty turn, so the user saw a blank response with no indication
that anything went wrong.

Mirror the interactive CLI's existing pattern (cli.py:9832): when
final_response is empty AND (failed|partial) is set AND error is
populated, surface 'Error: <detail>' as the visible text. Leaves
the None-with-no-error path and the '(empty)' sentinel path
untouched — an empty successful turn still renders empty, and
existing sentinel handlers keep owning their lane.

Reported by @counterposition in PR #20873; taking a minimal fix
rather than the broader structured-failure refactor proposed there.
2026-05-07 05:53:19 -07:00
Sofia Yang
f5a232af84 refactor: replace 'cmp' text with 🗜️ emoji in status bar
Address review feedback to use the clamp emoji (��️) instead of
the plain text 'cmp' prefix for the compression count indicator.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-05-07 05:27:45 -07:00
Sofia Yang
103e11926f feat(cli): show context compression count in status bar
Display the number of context compressions in the CLI status bar when
compressions > 0, helping users understand conversation compression
pressure during long sessions.

- Wide layout (>=76 cols): shows 'cmp N' between context percent and duration
- Medium layout (52-75 cols): shows 'cmp N' between percent and duration
- Narrow layout (<52 cols): omitted to save space
- Color-coded: dim for 1-4, warn for 5-9, bad for 10+
- Hidden when zero to keep the bar clean for new sessions

Closes #18564
2026-05-07 05:27:45 -07:00
Teknium
fb1ce793e6
feat(security): enable secret redaction by default (#17691, #20785) (#21193)
Flip the default for HERMES_REDACT_SECRETS from off to on so the redactor
already wired into send_message_tool, logs, and tool output actually runs
on a fresh install.

- agent/redact.py: env-var default "" → "true"
- hermes_cli/config.py: DEFAULT_CONFIG security.redact_secrets True;
  two config-template comments rewritten
- gateway/run.py + cli.py: startup log / banner warning when the user
  has explicitly opted out, so the downgrade is visible in agent.log
  and at CLI banner time
- docs/reference/environment-variables.md: description reconciled
- tests: flipped the default-pin, restructured the force=True
  regression test to explicit-false instead of unset

Users who need raw credential values (redactor development) can still
opt out via security.redact_secrets: false in config.yaml or
HERMES_REDACT_SECRETS=false in .env.

Closes #17691.
Addresses #20785 (short-term output-pipeline recommendation).
2026-05-07 05:10:33 -07:00
brooklyn!
5044e1cbf1
fix(cli): submit LF enter in thin PTYs (#20896) 2026-05-06 13:51:13 -07:00
Cleo
906881c38b fix(cli): catch OSError in _resolve_attachment_path to prevent ENAMETOOLONG dropping long slash commands
When the user pastes a long slash command like \`/goal <long prose>\` into
\`hermes chat\`, the input flows into \`_detect_file_drop()\`, whose
\`starts_like_path\` prefilter accepts anything starting with \`/\` and
forwards it to \`_resolve_attachment_path()\`. That helper calls
\`Path.exists()\` which invokes \`os.stat()\`, which raises
\`OSError(errno=ENAMETOOLONG)\` — 63 on macOS, 36 on Linux — when the
candidate exceeds NAME_MAX (typically 255 bytes).

The OSError propagates up to the broad \`except Exception\` in
\`process_loop\` (cli.py:11798), gets logged at WARNING level, and the
user's input is silently dropped. From the user's POV the chat prompt
hangs — the only signal is in agent.log:

  WARNING cli: process_loop unhandled error (msg may be lost):
    [Errno 63] File name too long: "/goal Drive the space board..."

This affects any slash command with prose-length arguments — \`/goal\`
in particular but also \`/skill\`, \`/cron\`, custom user commands.

Fix: wrap the \`exists()\`/\`is_file()\` calls in try/except OSError so
structurally-invalid path candidates cleanly return None. The slash-
command dispatch path downstream (cli.py:11718) then handles the
input correctly.

Tests: two new regression cases in test_cli_file_drop.py cover the
original \`/goal\` reproducer and a synthetic long path. All 35 file-
drop tests pass.

Reproducer (without the fix):
  python -c "from cli import _detect_file_drop;
             _detect_file_drop('/goal ' + 'a'*300)"
  → OSError: [Errno 63] File name too long
2026-05-06 06:34:48 -07:00
Teknium
a0fedfbb1b
feat(checkpoints): v2 single-store rewrite with real pruning + disk guardrails (#20709)
Replaces the per-directory shadow-repo design with a single shared shadow
git store at ~/.hermes/checkpoints/store/. Object DB is now deduplicated
across every working directory the agent has ever touched; a dozen
worktrees of the same project cost near-zero in additional disk.

Why
---
Pre-v2 design had three compounding problems that let ~/.hermes/checkpoints/
grow to multi-GB on active machines:

1. Each working directory got its own full shadow git repo — no object
   dedup across projects or across worktrees of the same project.
2. _prune() was a documented no-op: max_snapshots only limited the
   /rollback listing. Loose objects accumulated forever.
3. Defaults: enabled=True, auto_prune=False — users paid the disk cost
   without ever asking for /rollback.

Field report on a single workstation: 847 MB across 47 shadow repos,
mostly redundant clones of the hermes-agent source tree.

Changes
-------
- tools/checkpoint_manager.py: full rewrite. Single bare store, per-project
  refs (refs/hermes/<hash>), per-project indexes (store/indexes/<hash>),
  per-project metadata (store/projects/<hash>.json with workdir +
  created_at + last_touch). On first v2 init, any pre-v2 per-directory
  shadow repos are auto-migrated into legacy-<timestamp>/ so the new
  store starts clean. _prune() now actually rewrites the per-project ref
  to the last max_snapshots commits and runs git gc --prune=now. New
  _enforce_size_cap() drops oldest commits round-robin across projects
  when the store exceeds max_total_size_mb. _drop_oversize_from_index()
  filters any single file larger than max_file_size_mb out of the snapshot.
- hermes_cli/checkpoints.py: new 'hermes checkpoints' CLI
  (status / list / prune / clear / clear-legacy) for managing the store
  outside a session.
- hermes_cli/config.py: flipped defaults — enabled=False, max_snapshots=20,
  auto_prune=True. Added max_total_size_mb=500, max_file_size_mb=10.
  Tightened DEFAULT_EXCLUDES (added target/, *.so/*.dylib/*.dll,
  *.mp4/*.mov, *.zip/*.tar.gz, .worktrees/, .mypy_cache/, etc.).
- run_agent.py / cli.py / gateway/run.py: thread the new kwargs through
  AIAgent and the startup auto_prune hooks.
- Tests rewritten to match v2 storage while keeping backwards-compat
  coverage for the pre-v2 prune path (per-directory shadow repos under
  base/ are still swept correctly for anyone mid-migration).
- Docs updated: user-guide/checkpoints-and-rollback.md explains the
  shared store, new defaults, migration, and the new CLI;
  reference/cli-commands.md documents 'hermes checkpoints'.

E2E validated
-------------
- Legacy migration: pre-v2 shadow repos auto-archived into legacy-<ts>/.
- Object dedup: two projects with an identical shared.py blob resolve to
  7 total objects in the store (v1 would have stored the blob twice).
- max_snapshots=3 actually enforced: after 6 commits, list shows 3.
- Orphan prune: deleting a project's workdir + 'hermes checkpoints prune
  --retention-days 0' removes its ref, index, and metadata; GC reclaims
  the objects.
- max_file_size_mb=1 excludes a 2 MB weights.bin while keeping the
  tracked source code files.
- hermes checkpoints {status,prune,clear,clear-legacy} all work from the
  CLI without an agent running.

Breaking / migration
--------------------
No in-place data migration — legacy per-directory shadow repos are moved
into legacy-<timestamp>/ on first run. Old /rollback history is still
accessible by inspecting the archive with git; run
'hermes checkpoints clear-legacy' to reclaim the space when ready. Users
relying on /rollback must now set checkpoints.enabled=true (or pass
--checkpoints) explicitly.
2026-05-06 05:44:35 -07:00
helix4u
76074d9ee6 fix(cli): recover classic CLI output after resize 2026-05-06 04:20:54 -07:00
adybag14-cyber
e45df2e81e fix(ui): reduce status-line jitter while scrolling 2026-05-06 04:02:09 -07:00
Teknium
e70e49016f
fix(cli): guard logger.debug in signal handler (#13710 regression) (#20673)
CPython's logging module is not reentrant-safe.  `Logger.isEnabledFor`
caches level results in `Logger._cache`; under shutdown races the cache
can be cleared (`Logger._clear_cache`, triggered by logging config changes
from another thread) or mid-mutation when a signal fires, raising
`KeyError: <level_int>` (e.g. `KeyError: 10` for DEBUG) inside the signal
handler.

When that happens, the KeyError escapes before the `raise KeyboardInterrupt()`
on the next line can fire, which bypasses prompt_toolkit's normal interrupt
unwind and surfaces as the EIO cascade originally reported in #13710.

Issue #13710 shipped two defenses (asyncio exception handler + outer
`except (KeyError, OSError)` with EIO suppression) that cover the EIO
unwind path.  This patch closes the remaining escape hatch: the
`logger.debug` call at the top of `_signal_handler` itself.  Wrap it in a
bare `try/except Exception: pass` so logging can never raise through a
signal handler.

Observed in the wild: debug report on 0.12.0 (commit 8163d371) shows the
exact stack — KeyError: 10 at logging/__init__.py:1742 inside the
signal handler's `logger.debug`, followed by the EIO cascade from
prompt_toolkit's emergency flush.

Tests: adds `TestSignalHandlerLoggingRace` to
`tests/hermes_cli/test_suppress_eio_on_interrupt.py` with 6 new cases:
- normal path still raises KeyboardInterrupt
- KeyError(10) from logger.debug does not escape
- any Exception from logger.debug is swallowed
- agent.interrupt still fires when logger.debug raises
- agent.interrupt raising also does not escape
- BaseException (SystemExit) is NOT swallowed — guard uses `except Exception`
  deliberately so real shutdown signals still propagate

Closes #13710 regression.
2026-05-06 03:55:47 -07:00
kshitijk4poor
395dbcc873 feat(browser): add Lightpanda engine support with automatic Chrome fallback
Add Lightpanda as an optional browser engine for local mode.
Lightpanda is a headless browser built from scratch in Zig -- faster
navigation than Chrome with significantly less memory.

One config line to enable:
  browser:
    engine: lightpanda

New functions in browser_tool.py:
- _get_browser_engine() -- config/env reader with validation + caching
- _should_inject_engine() -- only inject in local non-cloud mode
- _needs_lightpanda_fallback() -- detect empty/failed LP results
- _chrome_fallback_screenshot() -- temporary Chrome session for screenshots
- Engine injection in _run_browser_command (--engine flag)
- browser_vision pre-routes screenshots to Chrome when engine=lightpanda

Config:
- browser.engine in DEFAULT_CONFIG (auto/lightpanda/chrome)
- AGENT_BROWSER_ENGINE in OPTIONAL_ENV_VARS
- /browser status shows engine info in local mode

Rebased from PR #7144 onto current main. All existing code preserved --
pure additions only (+520/-2).

25 new tests + 81 total browser tests pass (0 failures).
2026-05-06 03:23:19 -07:00