Commit graph

10990 commits

Author SHA1 Message Date
Gille
039fbb41fc
fix(desktop): show newly configured model providers (#41545) 2026-06-08 01:39:37 -07:00
floory
15c99b437f
fix(cli): set PYTHON env for node-gyp native builds on NixOS (#40690)
* fix(cli): set PYTHON env for node-gyp native builds on NixOS

node-gyp (triggered by node-pty during npm ci) looks for python3 on
PATH, which fails on NixOS because python3 lives in the nix store and
is not on the system PATH.

Add _nixos_build_env() — a two-tier helper that detects NixOS and:
1. Fast path: hermes venv python3 (~0s)
2. Fallback: nix-shell which python3 (~2-5s)

Wire it into _run_npm_install_deterministic() via a new env= parameter,
then pass it through cmd_gui() and _update_node_dependencies().

Non-NixOS systems: _nixos_build_env() returns None, behavior unchanged.

* fix(cli): merge _nixos_build_env() with os.environ, fix NixOS detection, add explicit return None

- Critical fix: both Tier 1 (venv) and Tier 2 (nix-shell) now return
  {**os.environ, "PYTHON": ...} instead of {"PYTHON": ...} — subprocess.run
  with env= replaces the entire environment, so the old code wiped PATH
  and broke npm/node on NixOS entirely.
- Uses re.search(r"^ID=nixos$", ...) for anchored NixOS detection instead
  of unanchored substring match (could match ID_LIKE=...nixos).
- Removes redundant Path.exists() guard before read_text(); just catches
  OSError (one filesystem read instead of two).
- Adds explicit return None at end of function for type-hint consistency.
2026-06-08 13:57:37 +05:30
teknium1
7a5827c8b0 test: repoint percentage-clamp source guard to gateway/slash_commands.py
test_gateway_run_clamped read gateway/run.py asserting the /usage stats handler
clamps pct with min(100, ...). That handler moved to gateway/slash_commands.py
in this PR's extraction; repoint the guard so it still fires on clamp removal.

tests/run_agent/ + tests/gateway/ 8024 passed / 0 failed.
2026-06-08 01:25:35 -07:00
teknium1
de5fe2fa7d test(gateway): repoint slash-command mocks after mixin extraction
Tests for the extracted handlers mocked symbols at gateway.run.*; the handlers
now resolve top-level-imported deps (atomic_json_write, fetch_account_usage,
render_account_usage_lines) and __file__ from gateway.slash_commands. Repoint
those mocks. run.py-resident methods (_increment_restart_failure_counts,
_clear_restart_failure_count) keep their gateway.run.atomic_json_write mock —
only the moved handlers' mocks change.

tests/gateway/ 6415 passed / 0 failed.
2026-06-08 01:25:35 -07:00
teknium1
619bd78273 refactor(gateway): extract 42 slash-command handlers into GatewaySlashCommandsMixin (god-file Phase 3b)
The in-session slash commands (/model, /reset, /usage, /compress, /voice, ...)
— 42 _handle_*_command handlers, ~3,200 LOC — move out of gateway/run.py into a
mixin GatewayRunner inherits. self._handle_*_command dispatch + all test
references resolve unchanged via the MRO.

Neutral deps (MessageEvent, EphemeralReply, Platform, t, cfg_get, atomic_*_write,
account-usage helpers, stdlib) imported at the mixin top level. The ~10 run.py-
internal helpers (_hermes_home, _load_gateway_config, _resolve_gateway_model,
_AGENT_PENDING_SENTINEL, ...) imported lazily inside the handlers that need them
to avoid an import cycle.

gateway/run.py 19157 -> 15870 LOC; GatewayRunner direct methods 214 -> 172.

Behavior-neutral: voice/update/model/compress command test suites pass; all 42
resolve to the mixin via MRO.
2026-06-08 01:25:35 -07:00
teknium1
02a4d66951 fix(auxiliary): retry transient transport error once before fallback (#16587)
Some checks failed
Deploy Site / deploy-vercel (push) Waiting to run
Deploy Site / deploy-docs (push) Waiting to run
Docker Build and Publish / build-amd64 (push) Waiting to run
Docker Build and Publish / build-arm64 (push) Waiting to run
Docker Build and Publish / merge (push) Blocked by required conditions
Lint (ruff + ty) / ruff + ty diff (push) Waiting to run
Lint (ruff + ty) / ruff enforcement (blocking) (push) Waiting to run
Lint (ruff + ty) / Windows footguns (blocking) (push) Waiting to run
Nix / nix (macos-latest) (push) Waiting to run
Nix / nix (ubuntu-latest) (push) Waiting to run
Tests / test (1) (push) Waiting to run
Tests / test (2) (push) Waiting to run
Tests / test (3) (push) Waiting to run
Tests / test (4) (push) Waiting to run
Tests / test (5) (push) Waiting to run
Tests / test (6) (push) Waiting to run
Tests / save-durations (push) Blocked by required conditions
Tests / e2e (push) Waiting to run
Nix Lockfile Fix / auto-fix-main (push) Has been cancelled
Nix Lockfile Fix / fix (push) Has been cancelled
A one-off transient transport failure (streaming-close / incomplete
chunked read / 5xx / 408) on an auxiliary LLM call escalated straight to
provider/model fallback (or, for context compression, dropped the summary
and entered cooldown), even when an immediate retry on the same provider
would have succeeded.

Add a single same-target retry at the top of call_llm() and
async_call_llm() — before the existing except-chain — gated on a new
_is_transient_transport_error() that reuses the canonical
_is_connection_error() detector plus a 5xx/408 status check. A second
failure (or any non-transient error: auth, other 4xx, malformed payload)
falls through to first_err and the existing fallback handling unchanged.

This lives in call_llm so every auxiliary task (compression, memory flush,
title generation, session search, vision) shares one transient-retry
surface, rather than each caller re-implementing it. The context
compressor needs no change — it calls call_llm and inherits the retry; its
existing fallback-to-main path (#18458) now composes naturally (retry the
aux model once, then fall back to main only if the retry also fails).

Co-authored-by: ARegalado1 <alberto.regalado@ymail.com>
2026-06-08 01:05:45 -07:00
kshitij
4107076128
Merge pull request #41155 from kshitijk4poor/fix/cli-modal-direct-invalidate-41098
fix(cli): paint approval/clarify/sudo/secret modal prompts directly, not via the throttle (#41098)
2026-06-08 01:01:51 -07:00
Teknium
4d18717b6c
fix(gateway): drop --replace from systemd unit templates (#41892)
Under systemd's Restart=always, --replace turns every restart into a
self-kill loop: the new instance reads gateway.pid, kills the previous
process, writes its own PID, and on the next restart the cycle repeats.
A process supervisor owns the lifecycle — --replace is for manual
one-shot takeovers and fights the supervisor.

Remove --replace from both the system-level and user-level systemd
ExecStart lines. The --replace flag stays available for manual
'hermes gateway run --replace' and on the macOS launchd fallback path
(#23387), which is a deliberate manual takeover, not a supervised unit.

Also drop RestartMaxDelaySec / RestartSteps from the templates — they
require systemd v255+ and are silently ignored on older versions. The
_strip_optional_systemd_directives normalizer stays so existing installs
whose on-disk unit still carries those directives aren't flagged as
outdated.

Credit: reported and diagnosed by @Skippy-the-Magnificent-one (PR #37145);
reimplemented here under project authorship because the original commit
was authored under a non-existent email.
2026-06-08 00:20:08 -07:00
Siddharth Balyan
d02a59b679
fix(nix): cold npm builds + fix-lockfiles real-build verification + auto-fix workflow (#41867)
* fix(nix): fix-lockfiles real-build verification + point auto-fix at nix/lib.nix

Two related fixes to the npm lockfile-hash tooling that, together, let a
broken nix build slip onto main and stay there:

1. fix-lockfiles trusted prefetch-npm-deps. It computes the hash from the
   lockfile *contents* and early-exited "ok" whenever that matched the pin,
   never running the real fetchNpmDeps + npmConfigHook build. Those two can
   disagree (the --apply path already works around it), so `--check`
   reported "ok" while a cold build was actually broken (e.g. lockfile
   engines/os/cpu fields the pinned nixpkgs strips from the deps cache,
   tripping npmConfigHook's consistency diff). Now, when prefetch says the
   hash matches, confirm with `nix build .#<attr>` before believing it:
   adopt the real fetchNpmDeps hash if nix reports a 'got:' mismatch,
   surface non-hash failures honestly (exit 1) instead of claiming "ok",
   and keep the transient-cache-failure skip.

2. nix-lockfile-fix.yml's auto-fix-main (and the PR-fix job) whitelisted and
   staged nix/tui.nix + nix/web.nix, but the single npmDepsHash moved to
   nix/lib.nix. So fix-lockfiles --apply edited nix/lib.nix, the guard
   flagged it as an "unexpected modified file", and the job exited without
   committing — the auto-healer could never push a fix. Point the guard
   regex and both `git add` lines at nix/lib.nix.

* fix(nix): fix cold npm builds — adopt the deps-cache lockfile in patchPhase

hermes-tui/hermes-agent could not be built from source on the pinned nixpkgs:
prefetch-npm-deps strips advisory lockfile fields (engines/os/cpu/funding/
bin/…) that newer npm writes into package-lock.json, then npmConfigHook
byte-compares the source lockfile against the cache's stripped copy and fails
on the difference. CI only stayed green because it substitutes the prebuilt
hermes-tui from Cachix and never cold-builds it; anyone building cold (e.g. a
local path: input, or a cache miss) hit the failure.

mkNpmPassthru's patchPhase now copies the cache's own normalized
package-lock.json over the source before npmConfigHook runs, so the
consistency check is trivially satisfied. The resolved dependency set
(version/resolved/integrity/dependencies) is identical — fetchNpmDeps derived
the cache from this very lockfile — so `npm ci` installs the same tree; only
advisory metadata is dropped. Genuine drift is still caught by the
fixed-output npmDepsHash check, which runs before this phase.

Verified by cold-building .#tui and .#default (full hermes-agent) from scratch
on the pinned nixpkgs (6201e2) — both succeed where they previously failed at
npmConfigHook.
2026-06-08 12:41:37 +05:30
Teknium
e45b745835
fix(file-tools): reject sentinel TERMINAL_CWD; anchor worktree edits before live cwd exists (#41861)
Completes the worktree-misroute fix from #35399, which made misroutes
visible (resolved_path) but did not prevent them: its divergence warning
only fired once a terminal command had populated the live cwd registry.
A fresh worktree session (registry still empty) with a stale TERMINAL_CWD='.'
got neither a worktree anchor nor a warning, so a relative write_file/patch
silently landed in the MAIN checkout.

Two changes in tools/file_tools.py:
- Treat sentinel TERMINAL_CWD values ('', '.', './', 'auto', 'cwd') and any
  relative value as UNSET rather than a literal anchor. Previously '.' was
  joined onto the process cwd, silently routing edits to wherever the process
  happened to be (the main repo, in a worktree session). The gateway already
  sanitizes the same set at import time; the file-tool layer now matches.
- New _authoritative_workspace_root(): prefers the live terminal cwd, else a
  sentinel-free absolute TERMINAL_CWD (the worktree path cli.py/main.py set
  for -w). _resolve_base_dir() and _path_resolution_warning() both use it, so
  a worktree session resolves into — and warns about escaping — the worktree
  from the very first write, before any cd has run.

Validation: 11 new/parametrized tests (sentinel handling, empty-registry
anchoring, early divergence warning, live-cwd precedence). 32/32 pass under
scripts/run_tests.sh. Live E2E: relative write in an empty-registry worktree
session lands in the worktree, main untouched.
2026-06-07 23:58:47 -07:00
LeonSGP43
e02f4c03c3 fix(gateway): abort --replace when old PID survives SIGKILL
When --replace force-kills an unresponsive old gateway, SIGKILL can fail
to reap it (uninterruptible sleep, zombie-reaping parent, etc.). The old
code unconditionally cleared the PID file and scoped locks and started a
fresh instance anyway, leaving two live gateways fighting over the same
bot token — a duplicate-gateway failure mode of #19471.

Re-verify the process is actually gone (via the Windows-safe _pid_exists
helper) after the force-kill; if it still appears alive, clear the
takeover marker and abort the replacement instead of duplicating.

Co-authored-by: Hermes <noreply@nousresearch.com>
2026-06-07 23:57:32 -07:00
konsisumer
3714caa1b9 fix(session): follow compression continuations for transcript reads 2026-06-07 23:57:20 -07:00
teknium1
329c33dac3 fix(terminal): read cwd overrides under raw task_id after container collapse
PR #41822 collapsed CWD-only overrides to the shared 'default' container
via _resolve_container_task_id, but three call sites kept routing the
*env/override lookup* through that collapsed id:

  - the foreground exec path read _task_env_overrides[effective_task_id],
    yet register_task_env_overrides writes under the raw task_id, so a
    CWD-only override's cwd was silently dropped (env spun up at the wrong
    root, exit 126);
  - the get-or-create env lookup keyed solely on effective_task_id, so an
    env cached under the raw task_id was missed and duplicated;
  - register_task_env_overrides synced the new cwd onto the env under the
    collapsed id, missing a live env cached under the raw task_id.

Container *identity* still collapses to 'default' (sharing preserved);
only the per-session env/override *lookup* now prefers the raw task_id and
falls back to the collapsed id. Fixes the 3 regressions in
test_terminal_task_cwd.py left red by #41822.
2026-06-07 23:44:04 -07:00
teknium1
d759c13c09 chore(salvage): lint fix + AUTHOR_MAP for desktop source-folders PR #40272
eslint --fix (import sort + padding-line-between-statements) on sidebar/index.tsx
after cherry-picking @dangelo352's commits; add release.py AUTHOR_MAP entry so
CI doesn't block on the unmapped author email.
2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez
694adec635 Smooth desktop sidebar drag sorting 2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez
f0fcaa1e54 Preserve dragged order inside source folders 2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez
0f500fc41d Render grouped sessions when local list is empty 2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez
3fc67b7333 Persist desktop sidebar drag order 2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez
ede4f5a4a3 Show messaging source folders in desktop sessions 2026-06-07 23:44:04 -07:00
D'Angelo Rodriguez
9d6992ee8a Show platform sources in desktop sessions 2026-06-07 23:44:04 -07:00
teknium1
1c68f6f81f refactor(gateway): extract kanban watcher loops into GatewayKanbanWatchersMixin (god-file Phase 3)
gateway/run.py is the largest god file (20k LOC, GatewayRunner with 220
methods). This lifts the cohesive kanban-watcher cluster — _kanban_notifier_watcher,
_kanban_dispatcher_watcher, _kanban_advance/unsub/rewind, _deliver_kanban_artifacts
(~1,035 LOC, 6 methods) — into gateway/kanban_watchers.py as a mixin that
GatewayRunner inherits.

Mixin (not free functions) because the methods use only self state: inheriting
keeps every self._kanban_* call site working unchanged via the MRO, making this
a behavior-neutral move. The methods' lazy imports (_kb, _decomp, _load_config,
Platform) travel with them; the mixin needs only stdlib + a matching
logging.getLogger('gateway.run').

run.py 20187 -> 19157 LOC; GatewayRunner direct methods 220 -> 214.

Behavior-neutral: gateway test suite 6582 passed / 0 failed; start() still wires
both watchers via self._kanban_*; MRO resolves all 6 to the mixin. One test
(corrupt-board quarantine retry) keyed its time-travel mock on the caller's
filename being gateway/run.py — updated to also accept gateway/kanban_watchers.py.

Establishes the mixin-extraction pattern for further GatewayRunner decomposition
(the 2406-LOC _run_agent and 1164-LOC _handle_message remain, but their callback
closures need a context-object redesign — deferred).
2026-06-07 23:14:18 -07:00
liuhao1024
6459b3d991 fix(terminal): collapse CWD-only overrides to shared container
When register_task_env_overrides is called with only a 'cwd' key
(ACP adapter workspace tracking), the task_id should collapse to
'default' so all interactive surfaces (TUI, gateway, dashboard)
share one long-lived container.

Previously, any override registration — even CWD-only — caused
_resolve_container_task_id to return the session key unchanged,
spinning up a separate container per session. This made it
impossible to authenticate into external services once and have
that auth available across all surfaces.

Now only overrides containing isolation keys (docker_image,
modal_image, singularity_image, daytona_image, env_type) trigger
per-task container isolation.

Fixes #37361
2026-06-07 23:04:54 -07:00
teknium1
1a626470ca refactor(cli): promote 9 closure handlers to top-level + extract their parsers (god-file Phase 2 follow-up)
Subcommands whose handler was a closure defined inside main() — memory, acp,
tools, insights, skills, pairing, plugins, mcp, claw — have their handler
promoted to a top-level function and their parser block extracted into
hermes_cli/subcommands/<name>.py (build_<name>_parser, injected handler).

These 9 had zero closure-over-main-locals, so promotion is a pure relocation.
acp/mcp parser blocks use the shared add_accept_hooks_flag helper.

main() 1798 -> 954 LOC (71% below the 3297 Phase-2 starting point);
add_parser calls in main.py 89 -> 28.

Deferred: sessions, computer-use, secrets handlers reference <name>_parser
(for a no-subcommand print_help fallback) — left in place to avoid the
_self_parser indirection; minority, low value.

Behavior-neutral: all 9 subcommands' --help (incl nested subactions) byte-
identical to pre-extraction (diff-verified). tests/hermes_cli/ 6519 passed /
0 failed; new test_subcommands_followup.py covers the 9 builders.
2026-06-07 22:56:23 -07:00
teknium1
524453dab5 refactor(agent): consolidate inner-retry-loop recovery flags into TurnRetryState (god-file Phase 1b)
run_conversation's inner retry loop tracked recovery state in ~15 scattered
bare booleans (per-provider OAuth refresh guards, format-recovery guards,
restart signals). They are now fields on a single TurnRetryState dataclass the
loop mutates in place (_retry.<flag>), giving the recovery bookkeeping a named,
testable home.

Loop-control vars (retry_count, max_retries, max_compression_attempts) stay as
plain locals — they're while-mechanics, not recovery bookkeeping.

Behavior-neutral: pure local→attribute rewrite of 42 references; kwarg NAMES
preserved (e.g. has_retried_429=_retry.has_retried_429). Live simple + tool
turns OK.

Validation: tests/run_agent/ 1615 passed / 0 failed under per-file process
isolation; new test_turn_retry_state.py pins the field contract.
2026-06-07 22:42:05 -07:00
teknium1
4d926f248d chore(release): add AUTHOR_MAP entry for rodboev 2026-06-07 22:39:51 -07:00
Rod Boev
648706936d test(gateway): add compression session_id rotation integration tests (#34089) 2026-06-07 22:39:51 -07:00
teknium1
39c4ac3af1 chore(release): add AUTHOR_MAP entry for JimStenstrom 2026-06-07 22:30:02 -07:00
JimStenstrom
cb5c24e37d fix(agent): sync logging session context on compaction id rotation
When context compaction rotates agent.session_id, it updates the gateway/tools
session context (set_current_session_id -> HERMES_SESSION_ID env + ContextVar)
but never updates the separate logging session context. The [session_id] tag on
log lines comes from hermes_logging._session_context (set once per turn in
conversation_loop.py), so post-compaction log lines in the same turn carry the
STALE old id while the message/DB/gateway state carry the new one — breaking log
correlation exactly at the compaction boundary.

Call hermes_logging.set_session_context(agent.session_id) alongside the existing
set_current_session_id, guarded so a logging failure can't regress the routing
update. Logs-only; no runtime or caching impact.

Refs #34089
2026-06-07 22:30:02 -07:00
Teknium
8e223b36ed
fix(curator): protect load-bearing built-in skills from archival/consolidation (#41817)
The curator's idle-archival path (apply_automatic_transitions under
prune_builtins) could archive the bundled `plan` skill, killing the
/plan slash command silently — typing /plan then returned 'Unknown
command' with no signal that a skill had vanished. The archived skill's
hash stays in .bundled_manifest, so 'hermes update' wouldn't re-seed it.

Add PROTECTED_BUILTIN_SKILLS ({plan}) enforced at the master gate
is_curation_eligible() (covers archive_skill + the transition walk) and
in the candidate enumerator (so the LLM consolidation pass never sees
them). Immune to prune_builtins, pin state, and LLM judgment.
2026-06-07 22:23:29 -07:00
Teknium
777dc9da62
feat(acp): emit session provenance metadata for compression rotation (#41724)
Closes #33617. Adds additive _meta.hermes.sessionProvenance to ACP session
surfaces so clients can detect compression-driven internal session rotation
without parsing status text, guessing from token drops, or reading state.db.

Derived on demand from the existing compression chain (parent_session_id /
end_reason) — no new persisted state, no schema change, no ACP protocol change.
ACP session_id stays the stable client handle.

- acp_adapter/provenance.py: derive provenance from SessionDB
- server.py: attach _meta to new/load/resume responses; emit a
  session_info_update when the internal head rotates during a prompt
2026-06-07 22:22:21 -07:00
teknium1
240c5d4543 chore: map martin.alca@gmail.com -> draix in AUTHOR_MAP
Salvage follow-up for PR #33221 — the cherry-picked commit is authored
under martin.alca@gmail.com (not the draixagent@gmail.com already mapped),
which would fail the CI author-attribution gate.
2026-06-07 22:22:01 -07:00
Martín Alcalá Rubí
132d6fe6d6 fix(volcengine): strip XML attribute fragments from tool_use.name (#33007)
VolcEngine's api/plan endpoint occasionally leaks raw XML attribute
fragments into tool_use.name when its protocol-translation layer
converts the model's native XML-style tool emission to Anthropic
Messages tool_use blocks, producing names like:

  terminal" parameter="command" string="true
  execute_code" parameter="code" string="true
  session_search" parameter="session_id" string="true

The corruption happens server-side at the provider, but it breaks
every tool call for affected users — no normalization rule in
repair_tool_call can rescue them, so each request runs through three
retries and then aborts as partial.

Add an early sanitizer in agent_runtime_helpers.repair_tool_call that
trims at the first ' " ', " ' ", '<', or '>' character (idx > 0
only) so the rest of the existing repair pipeline (lowercase /
snake_case / fuzzy match) can resolve the cleaned name normally.

Whitespace is deliberately NOT a separator — the legitimate
"write file" -> write_file repair path (covered by
test_space_to_underscore) must keep working.

Tests: 11 new regression cases in TestVolcEngineXmlPollution
covering all three observed polluted names, CamelCase + pollution
mix, single-quote variants, angle-bracket variants, clean-name
passthrough, and the whitespace-preservation guard. All 18 pre-
existing repair tests still pass (29 total in the file).
2026-06-07 22:22:01 -07:00
teknium1
f5bd09af4b refactor(acp): share interrupt-sentinel prefix, simplify guard
Replace the ACP-local prefix/suffix matcher + helper with a single
startswith() check against INTERRUPT_WAITING_FOR_MODEL_PREFIX, now
defined once in conversation_loop.py where the sentinel is produced.
Keeps the source of truth in one place so the guard cannot drift if
the status string changes. Net -17 LOC in server.py.

Also add lsaether to release.py AUTHOR_MAP.
2026-06-07 22:20:43 -07:00
lsaether
9b631e4ae1 fix(acp): suppress cancel interrupt sentinel 2026-06-07 22:20:43 -07:00
Teknium
2789bf4e25 fix(auxiliary): route Codex Responses path through shared converter (#5709)
The auxiliary Codex adapter maintained its own chat->Responses conversion
loop that forwarded every non-system message's role verbatim into
Responses input[]. When flush_memories()/compression replayed session
history containing assistant tool_calls + role=tool results, those tool
messages leaked into the request and the Responses API rejected them with
HTTP 400: Invalid value: 'tool'.

Route _CodexCompletionsAdapter.create() through the same shared converter
the main agent transport uses (_chat_messages_to_responses_input), so tool
calls become function_call items and tool results become function_call_output
items with a valid call_id. Single conversion path means no future drift.

Also remove the now-dead _convert_content_for_responses() helper — its only
caller was the private conversion loop this change deletes.

Co-authored-by: ProgramCaiCai <techxacm@gmail.com>
2026-06-07 22:18:31 -07:00
teknium1
568e127612 refactor(cli): extract 25 more subcommand parsers into hermes_cli/subcommands/
Batch extraction of every remaining subcommand whose handler is top-level and
whose parser block is pure argparse: model, setup, postinstall, whatsapp, slack,
login, logout, auth, status, webhook, hooks, doctor, security, dump, debug,
backup, import, config, version, update, uninstall, dashboard, gui, logs,
prompt-size.

Each becomes hermes_cli/subcommands/<name>.py with build_<name>_parser() and an
injected handler (no main import). dashboard also injects cmd_dashboard_register
for its nested 'register' action.

Behavior-neutral: all 25 subcommands' --help output (and nested subaction help)
diff-verified byte-identical to pre-extraction. Two RawDescriptionHelpFormatter
epilogs (debug, logs) needed their multi-line string interiors preserved at
column 0 — caught by the --help diff, not compile.

main() 3297 -> 1798 LOC across this PR; add_parser calls in main.py 179 -> 89.

Validation: tests/hermes_cli/ 6476 passed / 0 failed under per-file process
isolation; new test_subcommands_batch.py smoke-tests all 25 builders + the
dashboard two-handler case.
2026-06-07 22:18:14 -07:00
teknium1
4da45e8727 refactor(cli): extract profile + gateway/proxy parsers into hermes_cli/subcommands/
Follow-on to the cron extraction in the same Phase 2 PR. Same pattern:
per-group build_<name>_parser() functions with injected handlers, no main
import.

- subcommands/profile.py: build_profile_parser (190-line block out of main()).
- subcommands/gateway.py: build_gateway_parser (gateway + proxy, 238-line block;
  they shared one inline section). Imports argparse for SUPPRESS defaults.
- main(): two more inline blocks become single builder calls.

Behavior-neutral: 'profile [sub] --help' and 'gateway/proxy [sub] --help'
byte-identical to pre-extraction (diff-verified).

main() now 2723 LOC (was 3297 at Phase 2 start); add_parser calls in main.py
179 -> 141.

Validation: tests/hermes_cli/ 6476 passed / 0 failed under per-file process
isolation; new builder unit tests cover subactions, aliases, dispatch, flags.
2026-06-07 22:18:14 -07:00
teknium1
b2e6053243 refactor(cli): extract hermes cron parser into hermes_cli/subcommands/ (god-file Phase 2)
Phase 2 of the god-file decomposition plan. main()'s argparse tree is 179
inline add_parser calls in one 3,297-line function. This establishes the
hermes_cli/subcommands/ package and extracts the first group (cron) as the
proof-of-pattern:

- hermes_cli/subcommands/_shared.py: shared parser helpers (add_accept_hooks_flag),
  re-exported from main.py for backwards compat.
- hermes_cli/subcommands/cron.py: build_cron_parser(subparsers, cmd_cron=...).
  Handler injected so the module never imports main (cycle avoidance).
- main()'s ~155-line inline cron block becomes one build_cron_parser() call.

Behavior-neutral: 'hermes cron create --help' output is byte-identical to
origin/main. main() 3297 -> 3143 LOC.

Validation: tests/hermes_cli/ 6466 passed / 0 failed under per-file process
isolation; new test_subcommands_cron.py covers subactions, aliases, options,
no-agent tristate, injected dispatch, and --accept-hooks.
2026-06-07 22:18:14 -07:00
teknium1
54870847cb refactor(agent): extract run_conversation prologue into agent/turn_context.py
Phase 1 of the god-file decomposition plan. run_conversation's ~470-line
once-per-turn setup block (stdio guarding, retry-counter resets, user-message
sanitization, todo/nudge hydration, system-prompt restore-or-build,
crash-resilience persistence, preflight compression, the pre_llm_call hook, and
external-memory prefetch) is moved verbatim into build_turn_context(), which
returns a TurnContext dataclass the loop unpacks.

Behavior-neutral move-and-name refactor: the builder mutates `agent` exactly as
the inline code did; only the locals the loop reads back are returned.

- run_conversation: 4602 -> 4217 LOC (-385)
- agent/conversation_loop.py: 4965 -> ~4580 LOC
- new agent/turn_context.py: focused, dependency-injected, unit-tested in isolation

Tests: tests/run_agent/ 1570 passed / 0 failed under per-file process isolation.
Relocation follow-ups: 413_compression mocks now patch both module references;
nudge/on_turn_start source-inspection guards point at the extracted module.
2026-06-07 22:17:35 -07:00
Teknium
86c537d209
fix(memory): instruct in-turn consolidation + retry on overflow (#41755)
* fix(memory): make overflow errors instruct in-turn consolidation + retry

When bounded memory is full, the add/replace overflow errors now explicitly
tell the model to consolidate (merge/remove/shorten) and retry the write in
the same turn, matching the documented behavior. The replace-overflow path
now also echoes current_entries + usage for parity with add-overflow, so the
model has the same context to act on.

Closes #23378 (working-as-documented; this sharpens runtime to match docs).

* fix(memory): broaden overflow remediation hint beyond 'stale'

Say 'stale or less important' — entries don't have to be stale to be the
right ones to drop when making room.
2026-06-07 22:16:28 -07:00
teknium1
2a10da3a16 fix(gateway): keep /model + /reasoning overrides on topic recovery & compression splits
Session-scoped /model and /reasoning overrides were silently lost on
Telegram DM/forum topics and after compression session splits (#30479).

Root cause: _handle_message_with_agent rewrites source.thread_id via
_recover_telegram_topic_thread_id (lobby/stripped reply -> the user's
bound topic) before deriving the session key. The /model and /reasoning
handlers derived their override key from the raw inbound event.source,
skipping that recovery, so the override was stored under one key and the
next message turn read a different key.

Fix: add _normalize_source_for_session_key (applies the same recovery a
message turn does) and use it in both handlers before deriving the key.
session_id rotation on compression was never the cause — overrides are
keyed by the durable session_key; the split path preserves it.

Author: teknium1 <127238744+teknium1@users.noreply.github.com>
2026-06-07 22:10:32 -07:00
Hariharan Ayappane
b8469a81e3 fix(weixin): add rate-limit circuit breaker 2026-06-07 22:10:17 -07:00
Teknium
2e62862784
fix(telegram): use get_running_loop in polling-conflict retry reschedule (#41716)
The conflict-retry path called asyncio.get_event_loop() to reschedule
itself when a retry's start_polling raised. On Python 3.11+ (our floor)
that raises 'RuntimeError: There is no current event loop in thread
MainThread' when no loop is attached to the thread, which is what
happens when PTB dispatches this error callback. The retry never gets
scheduled, the adapter goes silent-but-alive, and gateway --replace
keeps spawning fresh instances that hit the same wall — the crash loop
reported in #19471 (worse under multi-profile, where two bots hold the
same conflict open).

We are inside a coroutine here, so asyncio.get_running_loop() is the
correct, guaranteed-valid replacement. Only get_event_loop() call in
any platform adapter, so no sibling sites.

Fixes #19471
2026-06-07 22:10:03 -07:00
teknium1
b5f7a1f299 chore(release): add basilalshukaili to AUTHOR_MAP 2026-06-07 22:09:45 -07:00
dusterbloom
cca3b77a4b fix(compression): clear _previous_summary on session end (defense-in-depth)
ContextCompressor inherited a no-op on_session_end() from ContextEngine, so
per-session iterative-summary state (_previous_summary) survived a real session
boundary on a reused compressor instance. Override it to clear the summary the
moment the owning session ends, complementing the point-of-use guard in
compress(). Closes the cross-session contamination path in #38788.

Co-authored-by: dusterbloom <32869278+dusterbloom@users.noreply.github.com>
2026-06-07 22:09:45 -07:00
Basil Al Shukaili
8513a6aec7 fix(compression): guard against cross-session stale _previous_summary contamination
When a cron or background session compacts, it sets _previous_summary for
iterative updates. If that session ends without /new or /reset (which calls
on_session_reset()), the stale summary survives on the ContextCompressor
instance. A subsequent live messaging session's compaction then injects it as
'PREVIOUS SUMMARY:' into the summarizer prompt — contaminating the live
session with unrelated content from the prior session.

Add an else guard in compress(): when no handoff summary is found in the
current messages but _previous_summary is non-empty, discard it so
_generate_summary() starts fresh instead of iteratively updating a stale
cross-session summary.

Fixes #38788
2026-06-07 22:09:45 -07:00
Teknium
ad8e57793d
fix(hermes_time): implement reset_cache() referenced in docstrings (#41728)
The module docstring and get_timezone()/cache comments documented a
reset_cache() helper for forcing tz re-resolution after config changes,
but the function was never defined — doc-followers calling it hit
AttributeError. Adds the helper to clear the cached tz state.

Surfaced in #32043.
2026-06-07 22:08:01 -07:00
Teknium
5408013369
fix(gateway): isolate DM sessions on user_id when chat_id is absent (#41764)
build_session_key collapsed every DM that arrived without a chat_id into
one shared 'agent:main:<platform>:dm' key. A single cached AIAgent then
served multiple users' conversations, bleeding history across senders.

DMs now fall back to the sender's user_id_alt/user_id (mirroring the
group-path participant precedence and the telegram auth-path fallback)
before the bare per-platform sink. Telegram's normal event path always
sets chat_id, so this hardens the synthetic-source / non-standard-adapter
paths that don't.
2026-06-07 22:07:07 -07:00
Teknium
a77bc2c08d
fix(compression): disable compression on background-review fork to prevent cross-turn stale-parent fork (#41708)
The per-session compression lock prevents same-window concurrent forks but
not cross-turn ones: the background-review fork shares the parent's
session_id, so if it won a compression race its new child session was never
adopted by the gateway (the fork is single-lifecycle). The next foreground
turn then started from the stale parent and compressed it again, leaving the
same parent with two sibling children.

Set review_agent.compression_enabled = False so the fork never triggers
compression. Both trigger sites in conversation_loop.py gate on
compression_enabled before calling _compress_context, so the fork can never
rotate the shared parent. Review needs full context anyway — compressing
would degrade the memory/skill summary.

The per-session lock is kept as defense-in-depth for any future shared-session
path. Adds a regression test that fails without the flag and passes with it.

Closes #38727
2026-06-07 22:06:48 -07:00
Teknium
48ae8029aa
fix(delegate): resolve custom-endpoint subagent pools by endpoint identity (#41730)
Subagents delegated to a custom endpoint were misrouted when the parent
ran on a different custom endpoint. Both runtimes collapse to
provider="custom", so _resolve_child_credential_pool() treated them as
interchangeable and handed the child the parent's pool. Leasing from it
then overwrote the child's delegated base_url with the parent's endpoint
via _swap_credential() — the child sent the delegated model name to the
wrong endpoint.

Custom runtimes now resolve by endpoint identity (the custom:<name> pool
key derived from base_url). The parent pool is reused only when both
parent and child resolve to the same custom endpoint; unregistered raw
endpoints return None so the child keeps its fixed delegated credential.
Non-custom provider paths are unchanged.

Fixes #7833.
2026-06-07 22:05:14 -07:00