#19194's fix added process.exit(0) to die()/dieWithCode() with a comment
relying on a process.on('exit') handler in entry.tsx that resets terminal
modes — but that handler was never installed. So /quit, Ctrl+C, Ctrl+D and
every process.exit() path left DEC mouse tracking (?1000/1002/1003/1006)
armed in the parent shell. The terminal then kept emitting mouse reports
into stdin — read as keystrokes by the shell or a freshly relaunched TUI —
surfacing as ...;...M garbage in the input box.
Install the missing handler. 'exit' fires once on real termination and runs
synchronous code only; resetTerminalModes() writes via writeSync, so the
disable sequence lands before the process is gone.
Fixes#28419
#41867 replaced mkNpmPassthru's patchPhase with
`cp $npmDeps/package-lock.json package-lock.json`, on the theory that
prefetch-npm-deps strips advisory fields (engines/os/cpu) from the cache
lockfile. That diagnosis was wrong.
prefetch-npm-deps copies the lockfile into the cache *verbatim*
(prefetch-npm-deps/src/main.rs reads it and writes it unchanged). Building the
cache fresh from the current root lockfile yields exactly the pinned
npmDepsHash, and that cache's package-lock.json is byte-identical to the source
(740 "engines" blocks on each side). With the hash correct, npmConfigHook's
consistency check passes on its own — verified by building .#tui and .#default
green with this (original) patchPhase.
So the cp was unnecessary, and worse: it bypasses the consistency check
wholesale, silently masking a genuinely stale npmDepsHash (a lockfile that
changed without its hash being refreshed) instead of failing loudly. The
original patchPhase keeps the check meaningful while still handling the one real
cosmetic difference it was written for (trailing newlines); stale-hash drift is
caught by the npmDepsHash itself plus the auto-fix workflow.
Keeps the fix-lockfiles real-build verification and the nix-lockfile-fix.yml
file-path fix from #41867 — only the patchPhase cp is reverted.
addToSystemPackages exports HERMES_HOME system-wide and puts the hermes CLI on
interactive users' PATH, so those users (in the hermes group) share the
gateway's state — that's the option's whole purpose. But the activation script
wrote config.yaml as 0640 (group read-only), so an interactive user saving a
setting via the CLI/TUI hit:
error: [Errno 13] Permission denied: '/var/lib/hermes/.hermes/config.yaml'
Make the mode conditional: 0660 when addToSystemPackages is set (group hermes
can write), else the previous 0640. .env stays 0640 either way — it holds
secrets, not user-facing settings. The config merge already preserves
user-added keys across rebuilds, so this simply lets interactive hermes-group
users actually make those edits.
Verified by evaluating the module's activation script for both option values:
addToSystemPackages=true -> chmod 0660, false -> chmod 0640.
Fold the slash-worker subprocess close into _finalize_session itself —
the single _finalized-guarded session-end chokepoint — instead of
relying on each caller (_teardown_session, _shutdown_sessions) to close
it separately. A future code path that finalizes a session directly can
no longer reintroduce the #38095 worker leak.
Idempotent: _SlashWorker.close() is poll()-guarded and _finalize_session
short-circuits on _finalized, so the existing teardown paths are
unaffected. Drops the now-redundant separate close() in
_shutdown_sessions.
Note: the active leak this issue reported was already fixed on main
(WS-orphan reaper #38591, _restart_slash_worker close, atexit shutdown).
This addresses the residual defense-in-depth gap the reporter correctly
identified in their follow-up comment.
close() is the hard teardown for true session boundaries (/new, /reset,
session expiry). It already closes the OpenAI client and child agents but
left the conversation-history list intact. Mirror the soft-eviction path
(_release_evicted_agent_soft clears _session_messages) so a held reference
to a closed agent — e.g. a draining background task — doesn't pin tens of
MB of tool outputs until the agent object itself is collected.
Daemon thread polls _is_orphaned (original ppid check + psutil create_time PID-reuse
guard, no PR_SET_PDEATHSIG). On orphan, drains an in-flight command up to a grace
window then os._exit(0). Started before the HermesCLI build to cover the spawn window.
Task: swl-qrf.8
_evict_cached_agent (the chokepoint for /new, /model, /undo, session
resets — 17 call sites) only popped the cache entry, dropping the
AIAgent reference without releasing its httpx client pool. AIAgent
holds reference cycles (callbacks, tool state) so CPython refcounting
does not free the client promptly; under steady gateway traffic the
held sockets + buffers accumulate and RSS climbs (the leak class behind
Now the chokepoint pops AND schedules a soft release_clients() on a
daemon thread (mirrors the cap-enforcer / idle-sweeper). Soft release
frees the client pool + per-turn child subagents but preserves the
session's terminal sandbox / browser / bg processes for resumption.
Mid-turn agents are skipped so a running request is never torn down.
Also fixes the no-lock branch which previously never popped at all.
Long-lived gateways under heavy cron/build workloads grow steadily (~18 MB/hr
post-phantom-dispatch-fix) and eventually need a restart-or-OOM. Four retention
sites, all confirmed live on current main:
1. _evict_cached_agent() (/model, /reasoning, codex-runtime, /undo, etc.) popped
the cache entry without releasing the agent's OpenAI client, httpx transport,
SSL context, or conversation history. Only /new cleaned up first. Now releases
clients on a daemon thread, matching _enforce_agent_cache_cap.
2. _release_evicted_agent_soft() now clears _session_messages after
release_clients() — tool outputs (file reads, terminal output, search results)
can be tens of MB per 100+-tool-call session; the list is rebuilt from
persisted session JSON on resume, so dropping it on soft eviction is safe.
3. The session-expiry watcher (permanent finalization) now drops the session's
per-session control dicts (_session_model_overrides, _session_reasoning_overrides,
_pending_approvals, _update_prompt_pending, _pending_model_notes). These leaked
one entry per session per gateway lifetime. NOTE: this is the session-finalize
path, NOT idle agent-cache eviction — an idle-evicted session is still alive and
rebuilds its agent from these overrides, so pruning them there would silently
reset a user's /model choice.
4. _tool_defs_cache is now bounded (_TOOL_DEFS_CACHE_MAX=8) with oldest-first
eviction instead of growing unboundedly across the distinct toolset/config
fingerprints a gateway sees over its lifetime.
Salvaged from #25318 by Michael Steuer (@mssteuer); fix 3 redirected from the
idle-sweep to the session-finalize lifecycle, magic number 8 lifted to a named
constant, test ported.
Fixes#19251
Co-authored-by: Michael Steuer <michael@make.software>
test_concurrent_compressions_same_session_serialize relied on a
time.sleep(0.25) inside the stubbed compressor to make the two threads
overlap inside the per-session lock window. Under CI CPU starvation that
sleep is insufficient: one thread can acquire -> compress -> rotate ->
RELEASE the lock before the other reaches try_acquire, so both acquire on
the shared session_id and both compress (the recurring 'Expected exactly
one agent to compress, got 2' failure on shard test (1)).
Replace the timing dependency with a threading.Barrier(2) wrapped around
the shared db's try_acquire_compression_lock: both threads rendezvous
immediately before the real (atomic) acquire, guaranteeing genuine
simultaneous contention regardless of scheduling. The real lock logic is
unchanged and still picks exactly one winner — this only fixes the test's
overlap guarantee. Restored after join so the post-join lock-leak
assertion hits the unwrapped method.
Verified: 20/20 plain + 15/15 under all-core CPU stress (load avg ~4.6),
where the old version flaked.
#41076 makes `hermes plugins list` discover nested category plugins (e.g.
observability/nemo_relay). This adds the missing enable/disable mutation path
so those plugins can actually be toggled, and fixes two incomplete-update
breakages on the #41076 base.
Before: `hermes plugins enable nemo_relay` -> "Plugin 'nemo_relay' is not
installed or bundled." (exit 1), because cmd_enable/cmd_disable went through
_plugin_exists(), which only checked top-level plugins/<name>/.
Changes:
- Add _resolve_plugin_key(): resolve a bare manifest/leaf name OR a full
path-derived key (observability/nemo_relay) to the canonical key the runtime
loader gates on, reusing #41076's _discover_all_plugins(). A bare leaf name
ambiguous across two categories resolves to None rather than silently picking
one.
- cmd_enable/cmd_disable resolve first, persist the canonical key, and drop any
stale legacy bare-name alias so the enabled/disabled lists can't drift into a
contradictory state. _plugin_exists delegates to the same resolver.
- Fix#41076 base breakages: _discover_all_plugins now returns 6-tuples, but
web_server._merged_plugins_hub() still unpacked 5 (ValueError on the
dashboard plugins-hub endpoint) and several test_plugins_cmd_list.py fixtures
were still 5-tuples. Both updated; the hub status check is now key-aware.
Verified e2e on the real CLI + runtime loader (isolated HERMES_HOME):
`hermes plugins enable nemo_relay` writes observability/nemo_relay to
config.yaml and the loader then loads it (enabled=True, error=None); a stale
bare-name alias is cleared on disable; the dashboard _merged_plugins_hub() runs
without crashing. Adds resolution + enable/disable tests; full
tests/hermes_cli/test_plugins_cmd* + web_server plugin tests green.
Follow-up to #41076 (#41066). Branched from that PR's head.
Salvaged from #6600 (@kristianvast) — re-scoped to the voice half only and
rebased onto current main. The cascading-interrupt hang half of the original
PR landed independently in dd0d1222a, so this carries ONLY Problem 1.
When a voice/audio message arrives while the agent is busy on the same
session, it hit the interrupt path with empty text because STT only ran after
the running-agent guard — the voice was effectively lost. Now we transcribe
audio BEFORE signaling the agent (and on the fresh-message path), echo the raw
transcript back to the user (🎙️), and _enrich_message_with_transcription
returns (text, transcripts) so callers can echo. A new
_dequeue_pending_with_transcription drives the post-agent drain the same way.
Reapplied onto _prepare_inbound_message_text (inbound enrichment was extracted
from the inline dispatch block since the original PR).
Co-authored-by: Kristian Vastveit <kristian@agrointel.no>
On-disk vitest coverage for the auto-heapdump disk-safety guard: opt-in
gating (suppressed diagnostics-only path), truthy-spelling acceptance,
manual-trigger passthrough, and the retention prune. Test approach
adapted from #21780 (briandevans) and #21822 (LeonSGP43), reconciled to
the merged gate semantics. Maps alarcritty into AUTHOR_MAP for CI.
Automatic heap dumps from the TUI memory monitor could write multi-GiB
.heapsnapshot files on every threshold cross, growing ~/.hermes/heapdumps
to tens of GiB. Add four layered safeguards:
- Gate auto-high/auto-critical snapshots behind HERMES_AUTO_HEAPDUMP=1;
manual dumps remain unchanged.
- Always write the lightweight diagnostics JSON sidecar so users still
get an actionable artifact when the snapshot is suppressed.
- Cap total bytes in the dump dir (HERMES_HEAPDUMP_MAX_BYTES, default
2 GiB), evicting oldest first, retaining the newest.
- Add a cooldown between auto dumps (HERMES_AUTO_HEAPDUMP_COOLDOWN_MS,
default 10 min) so an oscillating heap can't re-trigger.
Closes#21767
When agent.interrupt() fires during an active LLM call, the main poll loop
force-closes the worker-local httpx client to stop token generation. That
raises a transport error (RemoteProtocolError) on the worker thread — the
EXPECTED consequence of our own close, not a network bug.
The streaming retry loop misclassified it as a transient connection error
and retried; each doomed retry stalled for the full stream-stale timeout
(up to 300s). Because the gateway caches AIAgent instances per session, the
stale worker outlived the interrupted turn and raced the next turn's request
on shared client state — the root of the multi-minute cascading-interrupt
hang reported in the wild.
Fix: a request-local _request_cancelled token set by the poll loop right
before the force-close, in both interruptible_api_call (non-streaming) and
interruptible_streaming_api_call. The worker's exception handler checks the
token and exits cleanly — no retry, no fallback, no 'reconnecting' status —
instead of treating the forced error as transient. The token is request-
local (not agent._interrupt_requested, which is cleared at turn boundaries)
so a stale worker outliving its turn still recognizes its own forced close.
Original diagnosis and fix by @kristianvast (PR #6600), against the then-
inline methods in run_agent.py. Those were since extracted into
agent/chat_completion_helpers.py, so the fix is reapplied there.
Co-authored-by: Kristian Vastveit <kristianvast@users.noreply.github.com>
A misconfigured/slow external memory provider could hold the agent in
the 'running' state for minutes after the final response was delivered.
MemoryManager.sync_all / queue_prefetch_all looped provider.sync_turn /
queue_prefetch INLINE on the turn-completion path; a provider making a
blocking network/daemon call (a broken Hindsight daemon was observed
blocking ~298s before failing) blocked run_conversation from returning.
Because every interface (CLI, TUI, gateway) marks the agent 'running'
until run_conversation returns, the agent stayed busy for the full block
and any follow-up message triggered an aggressive interrupt that dropped
the message.
Dispatch provider sync/prefetch to a lazily-created single-worker
background executor. sync_all / queue_prefetch_all return immediately;
work completes (or fails, logged) in the background. A single worker
serializes writes so turn N lands before turn N+1. flush_pending()
provides a barrier for session boundaries and deterministic tests.
shutdown_all() drains the executor with a bounded timeout so a wedged
provider can never hang teardown.
Builtin-only / no-provider sessions spawn no executor (zero new threads
in the common case).
Review feedback (#40998): `rm -rf` / `Remove-Item -Recurse -Force` on the
install dir is destructive -- a user might still want whatever is there.
Rename the broken checkout to a timestamped `<dir>.broken-<ts>` backup and
re-clone fresh, so nothing is ever deleted. Transient cleanup of a clone
attempt that fails within the same run is left as-is.
Behavioral coverage for install.sh's clone_repo() guard (removes a
commit-less checkout, keeps a real one, ignores a non-repo dir) plus a
contract check that install.ps1's repo-validity gate requires a resolvable
HEAD.
An interrupted previous clone leaves the install dir's .git present but with
no initial commit. rev-parse --is-inside-work-tree and git status both still
succeed there, so the installer entered the update path and ran `git stash`,
which aborts with "You do not have the initial commit yet" and failed the
desktop install at the "Cloning Hermes repository" stage.
- install.ps1: add a `git rev-parse --verify HEAD` probe to the repo-validity
check so a commit-less checkout is treated as broken and re-cloned fresh.
- install.sh: mirror it at the top of clone_repo() — drop a partial checkout
with no resolvable HEAD so the fresh-clone path handles it (POSIX parity).
Fixes#40998
Lift the `_handle_*_command` cluster (2,077 LOC) out of HermesCLI into
hermes_cli/cli_commands_mixin.py; HermesCLI now inherits CLICommandsMixin so
every self.<handler> call resolves unchanged via the MRO. Behavior-neutral.
Import discipline mirrors gateway/slash_commands.py (PR #41886): neutral deps
imported at the mixin module top level; cli.py-internal helpers/constants
(_cprint, _ACCENT, save_config_value, ...) imported lazily inside each handler
via 'from cli import ...' so the mixin never imports cli at module scope.
cli.py 16215 -> 14139 LOC. One test mock repointed (cli.is_browser_debug_ready
-> hermes_cli.cli_commands_mixin.is_browser_debug_ready).
* fix(cli): set PYTHON env for node-gyp native builds on NixOS
node-gyp (triggered by node-pty during npm ci) looks for python3 on
PATH, which fails on NixOS because python3 lives in the nix store and
is not on the system PATH.
Add _nixos_build_env() — a two-tier helper that detects NixOS and:
1. Fast path: hermes venv python3 (~0s)
2. Fallback: nix-shell which python3 (~2-5s)
Wire it into _run_npm_install_deterministic() via a new env= parameter,
then pass it through cmd_gui() and _update_node_dependencies().
Non-NixOS systems: _nixos_build_env() returns None, behavior unchanged.
* fix(cli): merge _nixos_build_env() with os.environ, fix NixOS detection, add explicit return None
- Critical fix: both Tier 1 (venv) and Tier 2 (nix-shell) now return
{**os.environ, "PYTHON": ...} instead of {"PYTHON": ...} — subprocess.run
with env= replaces the entire environment, so the old code wiped PATH
and broke npm/node on NixOS entirely.
- Uses re.search(r"^ID=nixos$", ...) for anchored NixOS detection instead
of unanchored substring match (could match ID_LIKE=...nixos).
- Removes redundant Path.exists() guard before read_text(); just catches
OSError (one filesystem read instead of two).
- Adds explicit return None at end of function for type-hint consistency.
test_gateway_run_clamped read gateway/run.py asserting the /usage stats handler
clamps pct with min(100, ...). That handler moved to gateway/slash_commands.py
in this PR's extraction; repoint the guard so it still fires on clamp removal.
tests/run_agent/ + tests/gateway/ 8024 passed / 0 failed.
Tests for the extracted handlers mocked symbols at gateway.run.*; the handlers
now resolve top-level-imported deps (atomic_json_write, fetch_account_usage,
render_account_usage_lines) and __file__ from gateway.slash_commands. Repoint
those mocks. run.py-resident methods (_increment_restart_failure_counts,
_clear_restart_failure_count) keep their gateway.run.atomic_json_write mock —
only the moved handlers' mocks change.
tests/gateway/ 6415 passed / 0 failed.
The in-session slash commands (/model, /reset, /usage, /compress, /voice, ...)
— 42 _handle_*_command handlers, ~3,200 LOC — move out of gateway/run.py into a
mixin GatewayRunner inherits. self._handle_*_command dispatch + all test
references resolve unchanged via the MRO.
Neutral deps (MessageEvent, EphemeralReply, Platform, t, cfg_get, atomic_*_write,
account-usage helpers, stdlib) imported at the mixin top level. The ~10 run.py-
internal helpers (_hermes_home, _load_gateway_config, _resolve_gateway_model,
_AGENT_PENDING_SENTINEL, ...) imported lazily inside the handlers that need them
to avoid an import cycle.
gateway/run.py 19157 -> 15870 LOC; GatewayRunner direct methods 214 -> 172.
Behavior-neutral: voice/update/model/compress command test suites pass; all 42
resolve to the mixin via MRO.
A one-off transient transport failure (streaming-close / incomplete
chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to
provider/model fallback (or, for context compression, dropped the summary
and entered cooldown), even when an immediate retry on the same provider
would have succeeded.
Add a single same-target retry at the top of call_llm() and
async_call_llm() — before the existing except-chain — gated on a new
_is_transient_transport_error() that reuses the canonical
_is_connection_error() detector plus a 5xx/408 status check. A second
failure (or any non-transient error: auth, other 4xx, malformed payload)
falls through to first_err and the existing fallback handling unchanged.
This lives in call_llm so every auxiliary task (compression, memory flush,
title generation, session search, vision) shares one transient-retry
surface, rather than each caller re-implementing it. The context
compressor needs no change — it calls call_llm and inherits the retry; its
existing fallback-to-main path (#18458) now composes naturally (retry the
aux model once, then fall back to main only if the retry also fails).
Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>
Under systemd's Restart=always, --replace turns every restart into a
self-kill loop: the new instance reads gateway.pid, kills the previous
process, writes its own PID, and on the next restart the cycle repeats.
A process supervisor owns the lifecycle — --replace is for manual
one-shot takeovers and fights the supervisor.
Remove --replace from both the system-level and user-level systemd
ExecStart lines. The --replace flag stays available for manual
'hermes gateway run --replace' and on the macOS launchd fallback path
(#23387), which is a deliberate manual takeover, not a supervised unit.
Also drop RestartMaxDelaySec / RestartSteps from the templates — they
require systemd v255+ and are silently ignored on older versions. The
_strip_optional_systemd_directives normalizer stays so existing installs
whose on-disk unit still carries those directives aren't flagged as
outdated.
Credit: reported and diagnosed by @Skippy-the-Magnificent-one (PR #37145);
reimplemented here under project authorship because the original commit
was authored under a non-existent email.
* fix(nix): fix-lockfiles real-build verification + point auto-fix at nix/lib.nix
Two related fixes to the npm lockfile-hash tooling that, together, let a
broken nix build slip onto main and stay there:
1. fix-lockfiles trusted prefetch-npm-deps. It computes the hash from the
lockfile *contents* and early-exited "ok" whenever that matched the pin,
never running the real fetchNpmDeps + npmConfigHook build. Those two can
disagree (the --apply path already works around it), so `--check`
reported "ok" while a cold build was actually broken (e.g. lockfile
engines/os/cpu fields the pinned nixpkgs strips from the deps cache,
tripping npmConfigHook's consistency diff). Now, when prefetch says the
hash matches, confirm with `nix build .#<attr>` before believing it:
adopt the real fetchNpmDeps hash if nix reports a 'got:' mismatch,
surface non-hash failures honestly (exit 1) instead of claiming "ok",
and keep the transient-cache-failure skip.
2. nix-lockfile-fix.yml's auto-fix-main (and the PR-fix job) whitelisted and
staged nix/tui.nix + nix/web.nix, but the single npmDepsHash moved to
nix/lib.nix. So fix-lockfiles --apply edited nix/lib.nix, the guard
flagged it as an "unexpected modified file", and the job exited without
committing — the auto-healer could never push a fix. Point the guard
regex and both `git add` lines at nix/lib.nix.
* fix(nix): fix cold npm builds — adopt the deps-cache lockfile in patchPhase
hermes-tui/hermes-agent could not be built from source on the pinned nixpkgs:
prefetch-npm-deps strips advisory lockfile fields (engines/os/cpu/funding/
bin/…) that newer npm writes into package-lock.json, then npmConfigHook
byte-compares the source lockfile against the cache's stripped copy and fails
on the difference. CI only stayed green because it substitutes the prebuilt
hermes-tui from Cachix and never cold-builds it; anyone building cold (e.g. a
local path: input, or a cache miss) hit the failure.
mkNpmPassthru's patchPhase now copies the cache's own normalized
package-lock.json over the source before npmConfigHook runs, so the
consistency check is trivially satisfied. The resolved dependency set
(version/resolved/integrity/dependencies) is identical — fetchNpmDeps derived
the cache from this very lockfile — so `npm ci` installs the same tree; only
advisory metadata is dropped. Genuine drift is still caught by the
fixed-output npmDepsHash check, which runs before this phase.
Verified by cold-building .#tui and .#default (full hermes-agent) from scratch
on the pinned nixpkgs (6201e2) — both succeed where they previously failed at
npmConfigHook.
Completes the worktree-misroute fix from #35399, which made misroutes
visible (resolved_path) but did not prevent them: its divergence warning
only fired once a terminal command had populated the live cwd registry.
A fresh worktree session (registry still empty) with a stale TERMINAL_CWD='.'
got neither a worktree anchor nor a warning, so a relative write_file/patch
silently landed in the MAIN checkout.
Two changes in tools/file_tools.py:
- Treat sentinel TERMINAL_CWD values ('', '.', './', 'auto', 'cwd') and any
relative value as UNSET rather than a literal anchor. Previously '.' was
joined onto the process cwd, silently routing edits to wherever the process
happened to be (the main repo, in a worktree session). The gateway already
sanitizes the same set at import time; the file-tool layer now matches.
- New _authoritative_workspace_root(): prefers the live terminal cwd, else a
sentinel-free absolute TERMINAL_CWD (the worktree path cli.py/main.py set
for -w). _resolve_base_dir() and _path_resolution_warning() both use it, so
a worktree session resolves into — and warns about escaping — the worktree
from the very first write, before any cd has run.
Validation: 11 new/parametrized tests (sentinel handling, empty-registry
anchoring, early divergence warning, live-cwd precedence). 32/32 pass under
scripts/run_tests.sh. Live E2E: relative write in an empty-registry worktree
session lands in the worktree, main untouched.
When --replace force-kills an unresponsive old gateway, SIGKILL can fail
to reap it (uninterruptible sleep, zombie-reaping parent, etc.). The old
code unconditionally cleared the PID file and scoped locks and started a
fresh instance anyway, leaving two live gateways fighting over the same
bot token — a duplicate-gateway failure mode of #19471.
Re-verify the process is actually gone (via the Windows-safe _pid_exists
helper) after the force-kill; if it still appears alive, clear the
takeover marker and abort the replacement instead of duplicating.
Co-authored-by: Hermes <noreply@nousresearch.com>
PR #41822 collapsed CWD-only overrides to the shared 'default' container
via _resolve_container_task_id, but three call sites kept routing the
*env/override lookup* through that collapsed id:
- the foreground exec path read _task_env_overrides[effective_task_id],
yet register_task_env_overrides writes under the raw task_id, so a
CWD-only override's cwd was silently dropped (env spun up at the wrong
root, exit 126);
- the get-or-create env lookup keyed solely on effective_task_id, so an
env cached under the raw task_id was missed and duplicated;
- register_task_env_overrides synced the new cwd onto the env under the
collapsed id, missing a live env cached under the raw task_id.
Container *identity* still collapses to 'default' (sharing preserved);
only the per-session env/override *lookup* now prefers the raw task_id and
falls back to the collapsed id. Fixes the 3 regressions in
test_terminal_task_cwd.py left red by #41822.
eslint --fix (import sort + padding-line-between-statements) on sidebar/index.tsx
after cherry-picking @dangelo352's commits; add release.py AUTHOR_MAP entry so
CI doesn't block on the unmapped author email.
gateway/run.py is the largest god file (20k LOC, GatewayRunner with 220
methods). This lifts the cohesive kanban-watcher cluster — _kanban_notifier_watcher,
_kanban_dispatcher_watcher, _kanban_advance/unsub/rewind, _deliver_kanban_artifacts
(~1,035 LOC, 6 methods) — into gateway/kanban_watchers.py as a mixin that
GatewayRunner inherits.
Mixin (not free functions) because the methods use only self state: inheriting
keeps every self._kanban_* call site working unchanged via the MRO, making this
a behavior-neutral move. The methods' lazy imports (_kb, _decomp, _load_config,
Platform) travel with them; the mixin needs only stdlib + a matching
logging.getLogger('gateway.run').
run.py 20187 -> 19157 LOC; GatewayRunner direct methods 220 -> 214.
Behavior-neutral: gateway test suite 6582 passed / 0 failed; start() still wires
both watchers via self._kanban_*; MRO resolves all 6 to the mixin. One test
(corrupt-board quarantine retry) keyed its time-travel mock on the caller's
filename being gateway/run.py — updated to also accept gateway/kanban_watchers.py.
Establishes the mixin-extraction pattern for further GatewayRunner decomposition
(the 2406-LOC _run_agent and 1164-LOC _handle_message remain, but their callback
closures need a context-object redesign — deferred).
When register_task_env_overrides is called with only a 'cwd' key
(ACP adapter workspace tracking), the task_id should collapse to
'default' so all interactive surfaces (TUI, gateway, dashboard)
share one long-lived container.
Previously, any override registration — even CWD-only — caused
_resolve_container_task_id to return the session key unchanged,
spinning up a separate container per session. This made it
impossible to authenticate into external services once and have
that auth available across all surfaces.
Now only overrides containing isolation keys (docker_image,
modal_image, singularity_image, daytona_image, env_type) trigger
per-task container isolation.
Fixes#37361
Subcommands whose handler was a closure defined inside main() — memory, acp,
tools, insights, skills, pairing, plugins, mcp, claw — have their handler
promoted to a top-level function and their parser block extracted into
hermes_cli/subcommands/<name>.py (build_<name>_parser, injected handler).
These 9 had zero closure-over-main-locals, so promotion is a pure relocation.
acp/mcp parser blocks use the shared add_accept_hooks_flag helper.
main() 1798 -> 954 LOC (71% below the 3297 Phase-2 starting point);
add_parser calls in main.py 89 -> 28.
Deferred: sessions, computer-use, secrets handlers reference <name>_parser
(for a no-subcommand print_help fallback) — left in place to avoid the
_self_parser indirection; minority, low value.
Behavior-neutral: all 9 subcommands' --help (incl nested subactions) byte-
identical to pre-extraction (diff-verified). tests/hermes_cli/ 6519 passed /
0 failed; new test_subcommands_followup.py covers the 9 builders.
run_conversation's inner retry loop tracked recovery state in ~15 scattered
bare booleans (per-provider OAuth refresh guards, format-recovery guards,
restart signals). They are now fields on a single TurnRetryState dataclass the
loop mutates in place (_retry.<flag>), giving the recovery bookkeeping a named,
testable home.
Loop-control vars (retry_count, max_retries, max_compression_attempts) stay as
plain locals — they're while-mechanics, not recovery bookkeeping.
Behavior-neutral: pure local→attribute rewrite of 42 references; kwarg NAMES
preserved (e.g. has_retried_429=_retry.has_retried_429). Live simple + tool
turns OK.
Validation: tests/run_agent/ 1615 passed / 0 failed under per-file process
isolation; new test_turn_retry_state.py pins the field contract.