Commit graph

2906 commits

Author SHA1 Message Date
kshitijk4poor
b577f25100 refactor(gateway): dedupe drain-timeout resolution + share active_agents parse
Follow-up cleanups on top of the busy/idle readout (PR #50103):

- web_server.py /api/status reused the single drain-timeout resolver
  hermes_cli.gateway._get_restart_drain_timeout() (HERMES_RESTART_DRAIN_TIMEOUT
  env -> agent.restart_drain_timeout config -> default) instead of inlining a
  third hand-rolled copy of that precedence chain. Also fixes a subtle
  divergence: the inline copy used os.environ.get() so a set-but-empty env var
  was treated as a value rather than falling through to config; the shared
  resolver .strip()s and falls through correctly.
- Added gateway.status.parse_active_agents() and routed BOTH HTTP surfaces
  (/api/status and /health/detailed) through it, so the exposed active_agents
  field is consistently clamped non-negative. Previously /api/status clamped
  while /health/detailed exposed the raw file value, diverging on a corrupt
  count.
- Added TestParseActiveAgents covering the shared coercion contract.
2026-06-21 17:22:52 +05:30
Ben
0ee75469d7 feat(dashboard): surface gateway busy/drainable on /api/status
Give an external consumer (NAS) a trustworthy, always-reachable busy/idle
readout it can poll before a disruptive lifecycle action (restart,
migrate, stop, auto-update). The dashboard /api/status is the only HTTP
surface guaranteed up on a hosted agent regardless of which gateway
platforms are enabled, and it already reads gateway_state.json.

Add to /api/status (additive, non-breaking):
  - active_agents       — in-flight gateway-turn count (now refreshed
                          per-turn by the companion gateway-side commit)
  - gateway_busy        — running AND active_agents > 0
  - gateway_drainable   — running and live (a valid begin-drain target)
  - restart_drain_timeout — resolved seconds, so the consumer can size its
                          poll deadline without out-of-band knowledge
                          (env HERMES_RESTART_DRAIN_TIMEOUT → config
                          agent.restart_drain_timeout → default)

The busy/drainable contract is defined once in gateway.status
(derive_gateway_busy / derive_gateway_drainable) and consumed by both
/api/status and /health/detailed so the two surfaces can never disagree.
Liveness keys off gateway_running (a live PID/health probe), NEVER
gateway_updated_at — a healthy idle gateway never advances that timestamp.
All derived fields degrade to safe falsy values when the gateway is down
or the status file is absent/corrupt (never a spurious "busy" that would
wedge the consumer). active_sessions (the 5-min DB recency heuristic the
SPA reads) is left exactly as-is — new signal, new fields.

Tests (behaviour contracts, not snapshots): the pure derivation contract
across every running/state/count/liveness combination; /api/status
integration for busy, idle-drainable, draining, down, stale-busy-file,
corrupt-count, and timeout surfacing; and /health/detailed parity.
2026-06-21 17:22:52 +05:30
kshitijk4poor
1ca29723f0 fix(cli): log instead of swallow preflight-warning errors; consistent TUI warning field
Follow-up to the salvaged preflight-compression warning:
- Replace silent `except Exception: pass` at all 5 guard call sites
  (cli.py x2, gateway/slash_commands.py x2, tui_gateway/server.py) with
  `logger.debug(...)` so signature drift in the guard helper isn't hidden.
- tui_gateway/server.py: set the confirm dict's `warning` field to the
  merged message (was bare expensive-model text) so it matches
  `confirm_message` for any future consumer reading `warning`.
- Add trailing newlines to the two new files.
2026-06-21 16:31:56 +05:30
Tuna Dev
04730f32e7 fix(cli): warn when in-session model switch will preflight-compress
Adds hermes_cli/context_switch_guard.py mirroring the model_cost_guard
pattern. When a user switches models mid-session (Herm TUI picker, CLI,
or /model on Telegram/Discord), the warning surfaces on the existing
ModelSwitchResult.warning_message path used by the expensive-model
guard if the new model's compression threshold is below the current
session size.

Partial fix for #23767 — addresses only the 'user-facing guardrail
when switching from a high-context provider to a substantially
lower-context provider' slice. The other proposed fixes from that
issue (hard preflight token guard, metadata cache invalidation on
switch, compression safety invariant, oversized tool-output handling)
are out of scope for this PR.
2026-06-21 16:29:31 +05:30
kshitij
f6a504d088
Merge pull request #50025 from NousResearch/salvage/cron-run-immediate
fix(cron): execute job immediately on action=run
2026-06-21 13:53:13 +05:30
kyssta-exe
65d7c7fafd fix(cron): execute job immediately on action='run'
`cronjob(action='run')` (and `hermes cron run`) only set `next_run_at = now`
and returned success, relying on the scheduler ticker to actually execute the
job on its next tick. When no gateway/ticker is running — a CLI-only setup, or
the Windows case in #41037 — the job never executed: `run` reported success,
but `last_run_at` stayed null forever, no output, no delivery.

A manual `run` should actually run. `_execute_job_now` now:

- **claims the job via `claim_job_for_fire`** — the same at-most-once CAS the
  scheduler/external-provider fire path uses. This both advances `next_run_at`
  for recurring jobs and blocks a concurrently-running gateway ticker from
  double-firing the same job; if the claim is lost, the run is skipped (the
  tool reports `execution_skipped`). This closes the double-fire race that a
  bare `advance_next_run` left open (a tick whose `get_due_jobs` already
  captured the job between trigger and advance would still fire it).
- **delegates firing to `run_one_job`** — the single shared
  execute→save→deliver→mark body the ticker and external providers use — so
  failure delivery, `[SILENT]` handling, and live-adapter delivery stay
  identical across paths and can't drift. (The original salvage re-implemented
  this sequence inline and had already dropped failure delivery + `[SILENT]`.)

The tool response carries `executed`, `execution_success`, and either
`execution_error` or `execution_skipped`. The `hermes cron run` CLI message no
longer claims "It will run on the next scheduler tick" — it reports the actual
"Ran now: succeeded/failed" outcome (or the skip).

Salvaged from #41130 by @kyssta-exe (authorship preserved); reworked to reuse
`claim_job_for_fire` + `run_one_job` per review rather than re-implementing the
fire sequence inline. Adds tests for the claim-then-fire path, claim-lost skip,
failure reporting, and exception capture.

Fixes #41037

Co-authored-by: kyssta-exe <kyssta-exe@users.noreply.github.com>
2026-06-21 13:28:04 +05:30
annguyenNous
07424da76f fix(cron): keep ticker alive on BaseException + heartbeat-aware status
The in-process cron ticker (cron/scheduler_provider.py) caught only
`Exception` and logged at DEBUG, so a `SystemExit`/`KeyboardInterrupt`
raised from a misbehaving provider SDK or agent retry path killed the
ticker thread silently. The gateway PROCESS stayed up, so `hermes cron
status` — which only checks `find_gateway_pids()` — kept reporting
"✓ jobs will fire automatically" while no jobs ever fired (#32612,
#32895).

This makes ticker death survivable and detectable:

- The ticker loop now catches `BaseException` and logs at ERROR with a
  traceback, so a single bad tick no longer tears the thread down and
  the failure is visible in the gateway log.
- The loop records a heartbeat (`cron/ticker_heartbeat`, epoch seconds)
  on startup and after every tick — best-effort, never raised into the
  loop. Both ticker entry points (the gateway and the desktop fallback
  in web_server.py) funnel through `InProcessCronScheduler.start`, so one
  heartbeat site covers both.
- `hermes cron status` now reads the heartbeat age: if the gateway is
  running but the heartbeat is stale (> 200s, i.e. several missed ~60s
  ticks), it reports the ticker as STALLED and suggests a restart instead
  of falsely claiming jobs will fire. A missing heartbeat (older build /
  never ran) is treated as "unknown", not "dead".

Adds tests for BaseException survival, per-iteration heartbeat recording,
heartbeat round-trip/age, staleness detection, and silent-write-failure.

Salvaged from #49660 (BaseException survival on current structure),
extended with the heartbeat + honest-status reporting that the earlier
(pre-refactor) watchdog PRs #35616 and #33849 proposed.

Fixes #32612
Fixes #32895

Co-authored-by: banditburai <promptsiren@gmail.com>
Co-authored-by: sweetcornna <96944678+sweetcornna@users.noreply.github.com>
2026-06-21 13:00:50 +05:30
Tortugasaur
c02648c5dd fix(docs): align slash-command and docker docs 2026-06-20 23:23:47 -07:00
Kevin Anderson
b337afdf6e docs(cli): fix broken terminal-backend guide link in setup wizard
The terminal backend onboarding step pointed at
/docs/developer-guide/environments, which no longer exists. Point it at
the live docs page /docs/user-guide/configuration#terminal-backend-configuration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-20 23:23:47 -07:00
EloquentBrush0x
9bd5003d4f fix(spotify): quarantine dead tokens on terminal refresh failure
resolve_spotify_runtime_credentials() called _refresh_spotify_oauth_state()
without a try/except, so a terminal failure (HTTP 400/401, invalid_grant,
refresh_token_reused) raised AuthError but left the dead refresh_token in
auth.json. Every subsequent session re-read and retried the same token over
the network, failing identically each time.

Fix: wrap the refresh call and, when exc.relogin_required is True and a
refresh_token is present, clear the dead OAuth fields (access_token,
refresh_token, expires_at, expires_in, obtained_at) and write a
last_auth_error quarantine marker to auth.json before re-raising. The next
call sees no access_token and fails fast with spotify_access_token_missing —
no network retry — and the user is prompted to re-authenticate.

Mirrors the quarantine pattern already in place for Nous, xAI-OAuth,
Codex-OAuth (#28116, #28118), and MiniMax-OAuth (#28119).
2026-06-20 23:23:47 -07:00
teknium1
7ace96ba40 fix(compression): preserve goal, platform, and session indexing across rotation
Three state-loss bugs at the compression rotation boundary, fixed together
because they all live in the same ~80-line rotation block:

- #33618: a persistent /goal did not follow the rotation. load_goal does a
  flat per-session lookup with no lineage walk, so a goal silently died when
  compression minted a fresh child id. Added migrate_goal_to_session() and
  call it after the child session is created (move-not-copy: the parent row
  is archived as cleared so exactly one active goal row exists).

- #33906/#33907: if the child create_session raised (FK constraint,
  contended write), the outer handler only warned and let the agent continue
  on the NEW id — which has no row in state.db — producing an orphan session.
  Now the rotation rolls agent.session_id back to the still-indexed parent
  (reopening it) instead of stranding the conversation on a phantom id.

- #27633: the compaction-boundary on_session_start notification omitted the
  platform kwarg, so context-engine plugins saw source=unknown for every
  message after the boundary. Forward platform (matching the initial
  session-start call in agent_init.py).

Co-authored-by: denisqq <21260182+denisqq@users.noreply.github.com>
Co-authored-by: zccyman <16263913+zccyman@users.noreply.github.com>
Co-authored-by: liuhao1024 <sunsky.lau@gmail.com>
2026-06-20 20:06:24 -07:00
Teknium
170ef24c8f
fix(doctor): audit WhatsApp bridge at its resolved (HERMES_HOME) dir (#49890)
doctor's npm audit hardcoded PROJECT_ROOT/scripts/whatsapp-bridge. In
read-only Docker installs the bridge deps live in the writable HERMES_HOME
mirror (#49561), so node_modules was never found there and the bridge audit
silently skipped. Resolve the dir through the shared
resolve_whatsapp_bridge_dir() helper so doctor audits where deps actually
install. Falls back to the install-tree path if the helper is unavailable.
2026-06-20 19:55:12 -07:00
teknium1
15cfc2836f fix(kanban): anchor no-path worktree tasks on board default_workdir
Follow-up to the salvaged worktree-materialization fix. When a worktree
task has no explicit workspace_path, resolve the anchor from the board's
default_workdir (a git repo) and materialize <repo>/.worktrees/<id> per
task, instead of silently rooting under the dispatcher's CWD (whatever
directory launched the gateway, e.g. the Hermes checkout). If no
default_workdir is configured, raise with a clear message rather than
guessing from CWD.

Adds AUTHOR_MAP entry for the salvaged commit.
2026-06-20 19:12:23 -07:00
Ahmad Ashfaq
d79f67fda6 fix(kanban): materialize and reuse linked worktrees for worktree tasks
The dispatcher treated workspace_kind=worktree as metadata only and never
ran 'git worktree add', so every worktree task ran in the main repo checkout
instead of an isolated worktree — concurrent tasks silently shared one tree
and contaminated each other.

This materializes a real linked worktree at <repo>/.worktrees/<task_id> on
branch wt/<task_id> when resolve_workspace() handles a worktree task, treats a
repo-root workspace_path as shorthand for that location, persists the derived
workspace/branch back onto the task row, and — on rerun/redispatch — detects an
already-materialized linked worktree (via git-common-dir) and reuses it instead
of nesting a second .worktrees/<id> inside it.
2026-06-20 19:12:23 -07:00
Zheng Tao
491579fa05 fix(whatsapp): resolve bridge dir with HERMES_HOME mirror in Docker
In Docker the install tree (/opt/hermes) is read-only, so npm install for
the WhatsApp bridge fails with EACCES. Add resolve_whatsapp_bridge_dir() in
whatsapp_common.py: when the install dir is read-only, mirror the bridge
source into a writable HERMES_HOME location and use that. Both the
adapter and the 'hermes whatsapp' CLI resolve through the shared helper so
the install and runtime paths agree.

Fixes #49561
2026-06-20 17:05:27 -07:00
kshitijk4poor
69716a2e6f docs(compression): fix stale 'discarded' wording on in_place config flag
Review nit (yoniebans): the config.py comment still said compaction is
'lossy: the pre-compaction transcript is discarded, matching Claude Code /
Codex' — leftover from the original destructive design. The shipped behavior
is soft-archive: lossy for the LIVE context (what the model reloads), but the
pre-compaction turns are kept on disk (active=0, compacted=1), searchable via
session_search and recoverable. Comment now says so. Comment-only; no behavior
change.
2026-06-20 10:57:07 -07:00
kshitijk4poor
47fadc24d7 feat(compression): in-place compaction option that keeps one session id (#38763)
Context compression today rewrites the message list AND rotates the
session id — it ends the session, forks a parent_session_id child, and
renumbers the title (name -> name #2). That moving identity key is the
root cause of a whole bug cluster: /goal lost (#33618), pending response
lost at the split (#14238), orphan sessions (#33907), TUI sid desync
(#36777), FTS search gaps + duplicate sidebar entries (#45117), null
continuation cwd (#42228), and title-rename dead-ends (#48989). It also
forced a large defensive apparatus (compression lock, contextvar/env/
logging triple-sync, orphan finalization, gateway SessionEntry
re-propagation, tip projection) whose only job is surviving a
mid-conversation id change.

Add a compression.in_place config flag (default False during rollout).
When True, compaction rewrites the transcript and rebuilds the system
prompt but keeps the SAME session_id: no end_session, no child row, no
title renumber, no contextvar/logging re-sync, no memory/context-engine
session-switch. The conversation keeps one durable id for life, like
Claude Code / Codex. Compaction is lossy by design — the pre-compaction
transcript is summarized away, not archived.

The rotation path is unchanged when the flag is off (moved verbatim into
an else branch). Staged rollout: this PR ships the option behind a
default-off flag for live validation; a follow-up flips the default and
deletes the now-redundant rotation machinery, superseding the 14 open
band-aid PRs in this area.

- hermes_cli/config.py: add compression.in_place (default False), documented
- agent/agent_init.py: resolve the flag -> agent.compression_in_place
- agent/conversation_compression.py: branch compress_context() on the flag
- tests/run_agent/test_in_place_compaction.py: in-place invariants +
  rotation regression guard + config default

The pre-flush of current-turn messages (#47202) runs in BOTH modes, so no
boundary data loss. Prompt-cache invariant preserved: the system-prompt
rebuild is the same single sanctioned invalidation that already happens
during compaction — no NEW invalidation. Message alternation preserved.
2026-06-20 10:57:07 -07:00
teknium1
37a4dd4982 fix(auth): heal poisoned Nous inference URL on refresh instead of retaining it
A nous inference_base_url that fails the host allowlist (e.g. a stale
stg-inference-api.nousresearch.com persisted before the allowlist
existed) was only replaced 'if refreshed_url:' — so when the validator
rejected the URL it left the poisoned value in place. The 'falling back
to default' warning fired but never took effect: every subsequent call,
including the auxiliary compression call, kept hitting the dead staging
endpoint and 401'd.

Reset to DEFAULT_NOUS_INFERENCE_URL when validation returns None at both
refresh sites in resolve_nous_runtime_credentials, so a poisoned
auth.json self-heals on the next refresh. The proxy adapter already did
this correctly; this brings the two auth.py sites in line.
2026-06-20 10:53:45 -07:00
Teknium
11c6f4c7bc
feat(setup): Blank Slate setup mode — minimal agent, opt in to everything (#36733)
* feat(setup): Blank Slate setup mode — minimal agent, opt in to everything

Adds a third first-time setup option alongside Quick Setup and Full Setup.
Blank Slate forces ON only what an agent needs to run — provider & model,
the File Operations toolset, and the Terminal toolset — and turns
everything else OFF, then walks the user through opting each capability
back in.

What it does:
- platform_toolsets.cli = [file, terminal] (explicit, authoritative list)
- agent.disabled_toolsets = every other known toolset (web, browser,
  code_execution, vision, memory, delegation, cronjob, skills, image_gen,
  kanban, …). Applied last in the resolver, so it overrides the
  non-configurable platform-toolset recovery that would otherwise re-add
  toolsets like kanban — guaranteeing a true blank slate.
- Optional config features off: compression, memory + user-profile capture,
  checkpoints, smart model routing, auto session reset.
- Bundled skills default to NONE (reuses the .no-bundled-skills marker);
  offers to seed the full catalog.
- Walks through tools / plugins / MCP / messaging, all opt-in.

Proven end-to-end: with the Blank Slate config, model_tools.get_tool_definitions
emits exactly 6 schemas — patch, process, read_file, search_files, terminal,
write_file. Nothing else reaches the model.

Re-enable later via hermes tools / hermes skills opt-in --sync /
hermes setup agent.

Tests: tests/hermes_cli/test_setup_blank_slate.py (8 tests) pin the writers,
the resolver invariant ({file, terminal}), and the 6-schema end-to-end set.
Docs: getting-started/quickstart.md documents all three setup modes.

* feat(setup): Blank Slate fork — finish minimal, or walk through configs

After applying the minimal baseline (provider/model + file + terminal,
everything else off), Blank Slate now presents a choice instead of always
running the full walkthrough:

  1. Start with everything disabled — finish now with the minimal agent.
  2. Walk through all configurations — opt in to tools, skills, plugins, MCP,
     and messaging.

Provider/model and terminal are still configured first either way (the agent
can't run without them). The finish-now path records the bundled-skill opt-out
so future `hermes update` runs don't re-inject skills. The walkthrough body
moved to a separate _blank_slate_walkthrough() helper.

Tests: TestBlankSlateFork covers both branches (finish-now applies baseline +
skill opt-out and skips the walkthrough; walkthrough path invokes it). Docs
updated to describe the fork.
2026-06-20 10:45:55 -07:00
Teknium
5600105478 refactor(gateway): migrate slack/dingtalk/whatsapp/matrix/feishu/telegram/wecom/email/sms adapters to bundled plugins
Salvage of PR #41284 onto current main. Relocates the last 9 inline messaging
adapters (+ satellites: telegram_network, feishu_comment/_rules/meeting_invite,
wecom_crypto, wecom_callback) from gateway/platforms/ into self-contained
bundled plugins under plugins/platforms/<x>/, discovered via the platform
registry. Strips the per-platform core touchpoints from gateway/run.py,
gateway/config.py, hermes_cli/gateway.py, hermes_cli/setup.py, and
tools/send_message_tool.py.

Carries forward the migration fixes (explicit enabled:false honored,
get_connected_platforms forces discovery, plugin is_connected via
gateway.get_env_value, logs --component gateway matches plugins.platforms.*,
matrix hidden on Windows).

Additionally ports config keys main added since the PR base: the matrix
plugin's _apply_yaml_config now also covers allowed_users,
ignore_user_patterns, process_notices, and session_scope (the inline
gateway/config.py matrix block gained these in the 1340 commits the PR sat
open; they would otherwise have been silently dropped on deletion).
2026-06-20 10:26:45 -07:00
kshitijk4poor
a7dd98c860 fix(env): guard remaining malformed int/float env var casts with utils helpers
Widen the env_float() guard from #48735 across the whole bug class: a
non-numeric value (e.g. a stale .env "HERMES_API_TIMEOUT=abc" or a typo'd
port) raised an unhandled ValueError and crashed adapter/agent init.

Converts 22 genuinely-unguarded first-party int/float(os.getenv()) sites to
the canonical utils.env_int / utils.env_float helpers (the established house
pattern), instead of duplicating per-module helpers or inline try/except:

- gateway/config.py: WECOM_CALLBACK_PORT, BLUEBUBBLES_WEBHOOK_PORT
- gateway/platforms/email.py: EMAIL_IMAP/SMTP_PORT, EMAIL_POLL_INTERVAL
- gateway/platforms/feishu.py: dedup cache + text/media batch settings
- gateway/platforms/wecom.py, discord/adapter.py: text batch delays
- gateway/platforms/telegram.py: media batch delay, TELEGRAM_WEBHOOK_PORT
- gateway/platforms/whatsapp.py: WHATSAPP_NPM_INSTALL_TIMEOUT
- hermes_cli/auth.py: CODEX/XAI refresh timeouts
- agent/chat_completion_helpers.py: API/stream read/stale timeouts
- run_agent.py, agent/auxiliary_client.py: API + nous timeouts

Sites already guarded by try/except or local helpers are left untouched.
The HERMES_MAX_ITERATIONS sites are already guarded on main via
_current_max_iterations(), so they are not included.
2026-06-20 14:54:36 +05:30
helix4u
c253b07380 fix(model): clear stale endpoint credentials across switches 2026-06-19 19:58:26 -07:00
helix4u
95a3affc2e fix(model): keep Nous picker from restoring stale custom keys 2026-06-19 19:58:26 -07:00
Teknium
cf58f1a520
feat(titles): support language-aware title generation (#45296)
Make auxiliary title prompts match the user language by default, with an optional pinned `auxiliary.title_generation.language` config.
2026-06-19 17:15:52 -07:00
teknium1
64b21e50fb fix(cli): publish agent ref to cli module so memory on_session_end fires on exit
The god-file Phase 4 refactor (094aa85c37) moved agent construction into
CLIAgentSetupMixin, which set the atexit shutdown reference with a bare
`global _active_agent_ref`. After extraction that global binds the *mixin
module's* namespace, not cli.py's. cli._run_cleanup reads
cli._active_agent_ref to decide whether to fire the memory provider's
on_session_end hook — and it stayed None for the whole session, so the
`if _active_agent_ref:` branch was dead and on_session_end never ran on
/exit. Custom memory providers silently lost end-of-session extraction.

Fix: publish the reference onto the cli module explicitly
(`import cli as _cli; _cli._active_agent_ref = self.agent`), using the
deferred-import pattern already established in the mixin.

Regression test asserts cli._active_agent_ref is populated by the mixin's
publish line and guards against a relapse to the bare `global` form. The
existing shutdown tests passed only because they hand-assigned the ref,
which is exactly what masked this.
2026-06-19 16:59:43 -07:00
kshitijk4poor
d4e7dd609d refactor(windows): tidy managed-node resolver helpers
Behavior-preserving cleanups on the managed-node resolver:
- Hoist _candidate_node_command_names() out of the inner dir loop in
  find_hermes_node_executable (computed once, not per directory).
- Drop redundant os.environ.copy() at the two with_hermes_node_path(
  os.environ.copy()) sites \u2014 the helper already copies os.environ when
  called with no argument (verified env-equivalent).
- Add reciprocal keep-in-sync comments between iter_hermes_node_dirs()
  (hermes_constants.py) and hermesManagedNodePathEntries() (electron
  main.cjs), which mirror the same platform-ordering rule across the
  Python/Node boundary.
2026-06-20 02:12:16 +05:30
kshitijk4poor
fcc169057d fix(windows): prefer managed npm for hermes update desktop-rebuild gate
The `hermes update` desktop-rebuild gate still used a bare
`shutil.which("npm")` presence check. On a Windows box where the only
working npm is the Hermes-managed npm.cmd (not on PATH), the gate would
skip the desktop rebuild even though _build_web_ui / cmd_gui can now find
it via find_node_executable. Route the gate through the same resolver for
full bug-class coverage.

Surfaced during review of #49239.
2026-06-20 02:01:24 +05:30
helix4u
7a7b56d498 fix(windows): prefer managed node for whatsapp and desktop 2026-06-20 02:00:37 +05:30
teknium1
2bd1977d8f chore: release v0.17.0 (2026.6.19) 2026-06-19 12:38:31 -07:00
alt-glitch
88d523220f fix(mcp): address adversarial review round 2 (stale-publish race, parity holes)
Second review pass (Codex + Hermes subagent). Codex reproduced a real race with
a two-thread harness; both converged on the remaining issues.

- Generation-aware publish (fixes a lost-update race): two refresh callers (the
  late-refresh daemon and the between-turns prologue around turn 1) could each
  compute a snapshot outside the lock; a SLOWER caller holding an OLDER registry
  generation could acquire the publish lock after a newer caller and clobber it,
  deleting just-landed tools. refresh_agent_mcp_tools now captures
  registry._generation before computing and refuses to publish a stale set;
  agent._tool_snapshot_generation tracks the published generation.
- Context-engine routing names (_context_engine_tool_names) are now staged on a
  local and published atomically with the snapshot, and only claimed when this
  rebuild actually appended the schema — matching agent_init's dedup so a
  registry/plugin tool of the same name keeps its own dispatch. (Previously
  mutated live, before the publish lock, and on no-change refreshes.)
- CLI /reload-mcp: self.enabled_toolsets is resolved once at startup, so a
  server newly ENABLED in config mid-session wasn't picked up (TUI already
  re-resolved). Merge now-connected MCP server names into the override (unless
  the user pinned all/*), mirroring startup, and keep self.enabled_toolsets in
  sync. Closes the CLI/TUI parity hole.
- ACP (acp_adapter/server.py) routed through the shared helper — it was a 5th
  sibling rebuild that re-injected memory tools but NOT context-engine tools and
  bypassed the atomic/name-diff path (inert today, fragile).
- mcp_startup._resolve_discovery_timeout pulls its default from DEFAULT_CONFIG
  (single source of truth) instead of a stale hardcoded 5.0 literal.
- Tests: stale-generation-no-clobber, _skip_mcp_refresh honored, timeout
  fallback uses DEFAULT_CONFIG.
2026-06-19 11:57:43 -07:00
alt-glitch
b6e2a54a94 fix(mcp): address adversarial review round 1 (cache parity, gates, races)
Consolidated findings from three independent reviewers (Codex, Claude Code, a
Hermes subagent w/ the hermes-agent-dev skill):

- BLOCKING: refresh_agent_mcp_tools rebuilt only the registry subset, silently
  dropping post-build-injected memory-provider (mem0/honcho/…) and context-
  engine (lcm_*) tools on every refresh. Now additive-preserving: re-applies
  the same injectors agent_init uses, staged on locals and published atomically.
- Re-injection now honors the #5544 enabled_toolsets gate for context-engine
  tools, so a restricted-toolset platform can't get lcm_* leaked back in.
- Atomic read-diff-publish under one lock: the returned `added` set and the
  (tools, valid_tool_names) pair are consistent even under concurrent callers
  (no half-swap, no TOCTOU).
- background_review fork opts out (_skip_mcp_refresh) so its byte-identical
  tools[] cache parity with the parent is preserved.
- CLI /reload-mcp routed through the shared helper (was a 4th divergent copy
  with the same clobber bug + missing disabled_toolsets).
- Explicit reloads (TUI RPC + CLI) pass enabled_override so a server the user
  just enabled in config this session is picked up; automatic paths reuse the
  agent's build-time selection.
- mcp_discovery_timeout default 5.0 -> 1.5s: correctness now comes from the
  between-turns refresh, so the startup wait is only a small turn-1 UX bump
  rather than a heavy dead-server latency penalty.
- has_registered_mcp_tools checks registered TOOLS (not connected servers) so a
  zero-tool/prompt-only server doesn't make the per-turn hook fire forever.
- Tests: rewrote the thread-safety test to actually exercise the write path
  (alternating tool sets), added the #5544-gate regression, the memory/context
  preservation regression, and a "callable next turn via valid_tool_names"
  contract; removed a dead monkeypatch line.
2026-06-19 11:57:43 -07:00
alt-glitch
93d6e73028 fix(mcp): expose late-connecting MCP tools to the agent (TUI/CLI/gateway)
MCP servers that connect after the agent's one-time tool snapshot were
invisible for the whole session. Two root causes, fixed together:

1. The startup discovery wait was a flat 0.75s. HTTP/OAuth servers
   commonly take 2-6s on a cold connect, so they missed the window and
   their tools never entered the agent's snapshot. `thread.join(timeout)`
   already returns the instant discovery completes, so raising the bound
   costs ~0s for the common case (no MCP / fast servers) and only ever
   blocks for a genuinely-pending server, capped so a dead server can't
   freeze startup. The bound is now configurable via
   `mcp_discovery_timeout` (config.yaml, default 5.0s).

2. Three call sites duplicated the agent tool-snapshot rebuild (the TUI
   `reload.mcp` RPC, the gateway reload, and the TUI late-binding refresh
   thread), and the late-refresh detected changes by tool COUNT — missing
   an equal-size add/remove swap. Consolidated into one shared
   `tools.mcp_tool.refresh_agent_mcp_tools(agent)` helper that diffs by
   tool NAME, mutates the agent under a lock (thread-safe), and respects
   the agent's own enabled/disabled toolsets.

The late-binding refresh keeps its pre-first-turn cache-safety guard:
it never rebuilds the tool list once a turn has started, so the cached
prompt prefix is never invalidated mid-conversation.

Tests: new tests/tools/test_refresh_agent_mcp_tools.py covers the
name-based diff, in-place mutation, agent-scoped filtering, thread
safety, and the config-driven discovery bound (incl. instant-return
when nothing is pending). 75 passed across the touched areas.
2026-06-19 11:57:43 -07:00
Teknium
2a5e9d994a
Merge pull request #48275 from NousResearch/feat/cron-scheduler-provider-chronos
feat(cron): pluggable CronScheduler interface + Chronos managed-cron provider (scale-to-zero)
2026-06-19 07:51:59 -07:00
Ben
1928aa0443 fix(managed-scope): honor managed scope in config→env bridges too
Manual verification surfaced a second bypass class beyond the standalone
config loaders: several code paths bridge config.yaml values into os.environ
(HERMES_TIMEZONE, HERMES_REDACT_SECRETS, HERMES_MAX_ITERATIONS, TERMINAL_*,
network.force_ipv4, ...) by reading the raw user YAML, so the env the whole
process reads carried the USER's value even when an administrator pinned it —
e.g. a managed timezone was overridden because gateway/run.py wrote the user's
timezone into HERMES_TIMEZONE, and _resolve_timezone_name() checks the env var
first.

Wired the shared apply_managed_overlay() into every config→env bridge:

- gateway/run.py module-level startup bridge (timezone, redact_secrets,
  max_turns, terminal, display, gateway.strict, ...)
- gateway/run.py _reload_runtime_env_preserving_config_authority (the per-turn
  re-bridge that keeps config authoritative over reloaded .env — must keep
  MANAGED authoritative on every turn, not just startup)
- hermes_cli/main.py early security.redact_secrets / network.force_ipv4 bridge
  (runs before load_config is usable, at import time)
- hermes_cli/send_cmd.py top-level scalar config→env bridge

Verified end-to-end against a writable managed dir (12/12 checks incl. timezone,
logging, model, skin, gateway settings, write-guard) and in a clean process the
gateway per-turn bridge writes HERMES_TIMEZONE=<managed>. Adds an
order-independent regression test for the bridge overlay.
2026-06-19 07:46:33 -07:00
Ben
b0e47a98f9 fix(managed-scope): honor managed scope in all standalone config loaders
The skin bug was one instance of a class: several subsystems build their
config dict directly from config.yaml instead of routing through
hermes_cli.config.load_config (which carries the managed merge), so they
silently ignored administrator-pinned values. Audited every config.yaml
reader and fixed the behavioral-read bypasses:

- gateway/config.py load_gateway_config (messaging gateway: session_reset,
  quick_commands, stt, model, ...)
- gateway/run.py _load_gateway_config (its read_raw_config fast path also
  skipped the merge — read_raw_config returns raw user YAML)
- tui_gateway/server.py _load_cfg (new TUI + desktop backend: skin,
  reasoning_effort, service_tier, provider_routing)
- cron/scheduler.py (scheduled-job model/reasoning/toolsets/provider_routing)
- hermes_logging.py (logging.level/max_size_mb/backup_count)
- hermes_time.py (timezone)
- hermes_cli/doctor.py (memory-provider diagnostic reads effective config)

All route through a new shared managed_scope.apply_managed_overlay() helper
that mirrors _load_config_impl (env-only expansion so a user ${VAR} can't
shadow a managed literal, root-model-string normalization, leaf-merge) and is
fail-open. cli.py's earlier inline fix is refactored onto the same helper.

Write-back paths (slash_commands, telegram/yuanbao dm_topics, profile
distribution) are deliberately left reading raw user YAML — overlaying managed
values there would persist them into the user file. The dashboard
(web_server.py) already routes through load_config and needed no change.

TUI loader caches the RAW config so _save_cfg never writes managed values to
disk. Adds test_managed_scope_overlay.py (helper) and
test_managed_scope_loaders.py (per-surface integration); mutation-checked.
2026-06-19 07:46:33 -07:00
Ben
ddd519ea70 feat(managed-scope): surface managed scope in config show and doctor
- show_config prints an administrator header naming the managed source and
  lists the pinned config/env keys when a scope is active (silent otherwise).
- hermes doctor gains a managed_scope_check under Configuration Files that
  reports the resolved managed dir + pinned key counts, and flags a
  HERMES_MANAGED_DIR redirect (the documented foot-gun).
2026-06-19 07:46:33 -07:00
Ben
4f9e15df97 feat(managed-scope): guard writes to managed config/env keys
- set_config_value hard-rejects a managed config key (D2) and names the
  source, exiting non-zero.
- save_env_value / remove_env_value refuse a managed env key.
- save_config strips managed leaves from a bulk write (mechanical safety net)
  with a warning, so the unmanaged remainder still persists.
New _strip_dotted_keys helper drives the bulk-save pruning. All guards are
distinct from and layered after the existing is_managed() package-manager
write-lock.
2026-06-19 07:46:33 -07:00
Ben
81a663abea feat(managed-scope): apply managed .env last with override
load_hermes_dotenv now loads the managed-scope .env after user/project .env
and external secret sources, with override=True, so managed env values beat
the user .env and any pre-existing shell export. Reuses the existing dotenv
fallback + credential-sanitization path. Fail-open: no managed dir/.env is a
no-op and any error is swallowed so managed scope never blocks startup.
2026-06-19 07:46:33 -07:00
Ben
b5ddd6e719 feat(managed-scope): managed config layer wins over user config
_load_config_impl now deep-merges the managed config.yaml on top of the
expanded user config so managed leaves win while sibling keys stay
user-controlled (leaf-level merge, D3). Managed values are expanded against
the process env only, never user-defined ${VAR}, so a user can't shadow a
managed literal. The managed file's (mtime,size) is folded into the load
cache key so editing it invalidates the cache. This inverts the usual
env-over-config precedence for pinned keys by design (see design doc §4.1).
2026-06-19 07:46:33 -07:00
Ben
9cbcc0c9c8 feat(managed-scope): add managed_scope module (resolver, loaders, key helpers)
New hermes_cli/managed_scope.py resolves a system-level managed directory
(HERMES_MANAGED_DIR override > /etc/hermes), parses managed config.yaml/.env
with fail-open semantics, and exposes is_key_managed/is_env_managed helpers.
The system default is ignored under pytest and HERMES_MANAGED_DIR is added to
the conftest env scrub so a real managed scope can't leak into the suite.

Not wired into the load paths yet (Phases 2-3).
2026-06-19 07:46:33 -07:00
teknium1
a58287afcb
Merge remote-tracking branch 'origin/main' into pr48275-rebase
# Conflicts:
#	cron/scheduler.py
2026-06-19 07:40:29 -07:00
Teknium
35e7ca03d5 fix(kanban): treat already-gone worker as terminated, not survived
_terminate_reclaimed_worker early-returned on ProcessLookupError with
terminated=False. The new reclaim-defer guard reads that as 'worker
survived the kill' and defers the reclaim forever, so a stale task whose
worker is already dead never lands in result.stale. ProcessLookupError
means the process is gone — that IS a successful termination. Split it
from the generic OSError branch and set terminated=True.
2026-06-19 07:38:10 -07:00
Sahil Saghir
b9e521da23 fix(kanban): hold reclaim while the worker is still alive
release_stale_claims and detect_stale_running call _terminate_reclaimed_worker
and then release the task claim unconditionally, even when the termination did
not actually kill the worker. _terminate_reclaimed_worker already reports this
via its "terminated" flag, but the callers ignore it.

When a worker is parked in uninterruptible (D) state — for example throttled by
a cgroup memory.high limit — a pending SIGTERM/SIGKILL cannot be delivered until
the throttle lifts, so the kill is a no-op. The dispatcher then frees the claim
and spawns a fresh worker beside the still-alive one. Repeated every dispatch
tick this accumulates duplicate workers without bound, deepening the memory
pressure that caused the throttle in the first place — a self-reinforcing
runaway.

Fix: gate both automatic reclaim paths on _worker_survived_termination(). When
we attempted to kill our own host-local worker and it is still alive, defer the
reclaim (_defer_reclaim_for_live_worker extends the claim a short grace and
emits a reclaim_deferred event) instead of releasing. This guarantees at most
one live worker per task and is self-correcting: not spawning a duplicate is
what relieves the pressure so the pending signal lands and the worker dies, and
the next tick reclaims cleanly. Non-host-local claims and the operator-driven
reclaim_task() path keep their existing force-release behaviour.

Related: #41448 (concurrent dispatchers amplify this by doubling reclaim
frequency); #42858 (kill the worker rather than orphan it on archive).

Tests: defer-when-worker-survives, reclaim-when-killed,
release-when-not-host-local, and the detect_stale_running path.
2026-06-19 07:38:10 -07:00
Teknium
d7bff949af
fix(cli): default cli_refresh_interval to 1.0 to keep status bar alive (#49087)
PR #49056 set the default to 0, which reverts the #45592 idle-clock fix:
without a periodic invalidate, prompt_toolkit stops repainting the bottom
chrome during idle and the status bar goes stale/disappears after a turn.

Restore 1.0 as the default for everyone. The config knob stays — users on
emulators where the per-second redraw fights auto-scroll (#48309) can set
display.cli_refresh_interval: 0 to opt out.
2026-06-19 07:35:06 -07:00
Ben Barclay
1e70df5fdd feat(gateway): multiplex phase 4 — lifecycle guard + per-profile observability
- _guard_named_profile_under_multiplexer: when the default gateway is running
  with gateway.multiplex_profiles=on, a named-profile 'hermes gateway run' hard
  -errors (pointing at the multiplexer) instead of double-binding that
  profile's platforms. Inert unless all hold: this invocation is a named
  profile, a default-profile gateway is alive, and its config has multiplexing
  on. --force overrides. Wired into run_gateway's guard chain.
- write_runtime_status gains served_profiles: the secondary-adapter startup
  records [active] + multiplexed profiles into runtime_status.json so
  'hermes status' can show per-profile coverage without a second probe. Absent
  for single-profile gateways.

Tests: served_profiles round-trips and is absent by default; guard is inert for
the default profile / under --force / when no default gateway is running.
2026-06-19 07:34:15 -07:00
Ben Barclay
f538470cf4 feat(gateway): multiplex phase 2 — fail-closed profile credential isolation (Workstream A)
The credential gate. When multiplexing is active, a profile's secrets resolve
from a context-local scope, never the process-global os.environ (which in a
multiplexer may hold another profile's keys, and is inherited by every
subprocess spawned with env=dict(os.environ)).

- agent/secret_scope.py: get_secret() backed by a secret-scope contextvar.
  FAIL-CLOSED: when multiplex is active and no scope is installed, an unscoped
  read RAISES UnscopedSecretError instead of falling back to os.environ — a
  missed/new call site crashes loudly at that line rather than leaking a
  cross-profile value. Genuinely-global vars (HERMES_*, PATH, kanban paths,
  …) keep reading os.environ via an allowlist. load_env_file/build_profile_
  secret_scope parse a profile .env into an isolated dict WITHOUT mutating
  os.environ. Off by default => transparent os.getenv behavior.
- hermes_cli/runtime_provider.py: all credential/provider/base-url reads go
  through _getenv -> get_secret.
- agent/credential_pool.py: env fallbacks route through get_secret (the
  ~/.hermes/.env-first preference is preserved and already profile-correct via
  the home override).
- tools/mcp_tool.py: MCP config  interpolation resolves through
  get_secret, so a server's  picks up the routed profile's value.
- gateway/run.py: set_multiplex_active() at GatewayRunner init; per-turn .env
  reload is a no-op for credentials in multiplex mode (secrets come from the
  scope, not global env); _profile_runtime_scope context manager combines the
  HERMES_HOME override + secret scope; _run_agent wraps _run_agent_inner in
  that scope (resolved via _resolve_profile_home_for_source) when multiplexing.

Propagates into the agent worker thread for free via the existing
copy_context() in _run_in_executor_with_context.

Tests: 13 unit (fail-closed, scope isolation, global allowlist, .env parsing
without environ mutation) + 7 E2E (runtime_provider + MCP interpolation prove
two profiles isolated, unscoped read raises, globals still read environ).
2026-06-19 07:34:15 -07:00
Ben Barclay
d82f9fa7f7 feat(gateway): multiplex phase 0 — config flag, profile enumeration, profile-stamped session keys
Foundations for serving multiple profiles from one gateway process, inert
when off:

- gateway.multiplex_profiles config flag (default false), round-trips through
  GatewayConfig and load_gateway_config (top-level + nested gateway.* form).
- hermes_cli.profiles.profiles_to_serve(multiplex): the single chokepoint for
  which (profile, HERMES_HOME) pairs the gateway serves. Lightweight dir scan;
  active-profile-only when off, default + all named profiles when on.
- build_session_key gains a profile= namespace slot. Default/None reuse the
  historical 'agent:main:...' literal BYTE-IDENTICALLY (no session migration,
  positional parsers unaffected); a named profile becomes 'agent:<profile>:...'
  so two profiles on the same platform/chat never collide.
- SessionStore._resolve_profile_for_key + _session_key_for_source fallback
  resolve the namespace from the flag (legacy when off, active profile when on).

Tests: byte-identical-when-off (parametrized), namespace isolation, positional
layout preserved, config round-trip, profiles_to_serve enumeration.
2026-06-19 07:34:15 -07:00
teknium1
1d59d2dcae feat(desktop): resolve OAuth status for catalog-only account providers
Accounts-tab cards derived from the unified provider_catalog() carry
status_fn=None and had no hardcoded branch in _resolve_provider_status,
so any future OAuth/account provider plugin rendered permanently
logged-out. Fall through to the canonical hermes_cli.auth.get_auth_status
slug dispatcher and adapt its shape, so membership AND status both
auto-extend with the hermes model universe.
2026-06-19 07:26:46 -07:00
Austin Pickett
8fe7b52ebf test(desktop): lock GUI⊇hermes model provider parity; surface Bedrock
Adds the end-to-end parity contract test: every CANONICAL_PROVIDERS entry (the
`hermes model` universe) must be configurable on a desktop Providers tab —
keys(/api/env) ∪ ids(/api/providers/oauth) ⊇ canonical. Asserted as an
invariant against the live endpoints so the GUI can never silently drift from
the CLI again.

Surfacing this contract caught Bedrock: it's aws_sdk (no api-key vars), so it
had no Keys card. /api/env now tags AWS_REGION/AWS_PROFILE to the bedrock
provider card. Anthropic is whitelisted as a legitimate dual-tab provider
(direct API key + subscription OAuth).

Also refreshes the _OAUTH_PROVIDER_CATALOG docstring to describe its new role
as the override base for _build_oauth_catalog().
2026-06-19 07:26:46 -07:00
Austin Pickett
60dfa0f31b feat(desktop): Accounts tab derives membership from unified provider catalog
/api/providers/oauth now unions the explicit hand-tuned OAuth cards
(_OAUTH_PROVIDER_CATALOG — bespoke flow/status/cli, plus the api-key Anthropic
PKCE card and synthetic claude-code row) with every accounts-tab provider in
provider_catalog(). Any OAuth/external provider in the `hermes model` universe
now appears automatically, closing the drift where google-gemini-cli and
copilot-acp had no Accounts card despite being CLI-configurable.

Adds read-only status cards for google-gemini-cli (via existing
get_gemini_oauth_auth_status) and copilot-acp (managed-by-CLI, like claude-code).
DELETE handler routes through the same _build_oauth_catalog() builder.

Parity test asserts the Accounts tab offers every accounts-tab catalog provider
as an invariant.
2026-06-19 07:26:46 -07:00