Commit graph

1182 commits

Author SHA1 Message Date
Brooklyn Nicholson
960ea8a849 fix(dashboard): honor injected HERMES_DASHBOARD_SESSION_TOKEN
The desktop shell mints a session token and signs its /api + /api/ws
calls with it via HERMES_DASHBOARD_SESSION_TOKEN, but the main-merge
restored a web_server.py that ignored the env var and minted its own
random _SESSION_TOKEN -- so every desktop request 401'd and the UI
reported "gateway offline". Read the injected token (fall back to a
fresh random one) so loopback HTTP + WS auth line up.

Adds a regression test so a future merge can't silently drop the read.
2026-05-29 23:26:31 -05:00
Brooklyn Nicholson
da6646a23b fix(merge): restore contracts caught by main-target CI 2026-05-29 21:46:11 -05:00
Brooklyn Nicholson
b86043834f Merge origin/main into bb/gui
Adopt main's web/ dashboard layout (apps/dashboard removed; web/ restored),
keep bb/gui's desktop CLI/update workspace handling, and preserve main's
mTLS/URL validation MCP changes. Dashboard backend is aligned to main with
only the intended STT provider quarantine/ElevenLabs override reapplied.
2026-05-29 20:40:08 -05:00
Bartok9
54aa4db1de fix(cli): remove Hermes-managed node/npm/npx symlinks on uninstall
The POSIX installer drops node/npm/npx symlinks in ~/.local/bin pointing
into $HERMES_HOME/node and prepends ~/.local/bin to PATH, shadowing an
existing nvm. Uninstall removed the hermes wrapper but left these behind,
so the user's default node/npm/npx stayed redirected after uninstall.

Add remove_node_symlinks() and call it from run_uninstall. It removes
~/.local/bin/{node,npm,npx} only when each is a symlink resolving into the
current Hermes home's node dir, so a link the user repointed at nvm or a
real binary is never touched. Handles dangling links too.

Closes #34536
2026-05-29 17:24:38 -07:00
Teknium
689ef5e233
feat(cli): warn on unsupported pip installs + fix stale update-check cache (#34491) (#34846)
* docs(code-execution): document HERMES_* env narrowing + passthrough workaround

The execute_code sandbox-child env scrub (108397726, #27303) deliberately
dropped the broad HERMES_ prefix passthrough, keeping only an operational
4-var allowlist (HERMES_HOME/PROFILE/CONFIG/ENV). A script that relied on a
non-secret HERMES_* var (HERMES_BASE_URL, HERMES_KANBAN_DB, HERMES_*_WEBHOOK,
or a plugin-defined one) now sees it unset in the child.

Document the behavior change and the two recovery routes (terminal.env_passthrough
in config.yaml, or required_environment_variables in skill frontmatter), plus
the debug log line that surfaces the drop for diagnosis.

* feat(cli): warn on unsupported pip installs + fix stale update-check cache after pip upgrade

Banner now shows a yellow warning when detect_install_method() == 'pip':
'pip install hermes-agent' isn't the supported install path (it exists on
PyPI for internal/CI reasons), so updates and issue support don't behave
correctly. Reuses existing install-method detection; warn, never block.

Also fixes #34491: check_for_updates() keyed its 6h cache only on ts+rev.
On the pip path (no HERMES_REVISION), rev is always None, so a
'pip install --upgrade' changed VERSION but left the cache valid — the
stale 'N commits behind' count survived the upgrade. Cache now also keys
on the installed VERSION and invalidates on mismatch.
2026-05-29 13:30:28 -07:00
Bartok9
3845d86b93 fix(cron): restore jobs.json emptied by config migration on update
Config-version migrations have been observed to leave cron/jobs.json
valid-but-empty after `hermes update`, silently dropping every scheduled
job (#34600). The existing malformed-shape guards in cron/jobs.py don't
catch this because {"jobs": []} is valid JSON.

Add restore_cron_jobs_if_emptied() as a post-migration safety net: if the
live cron/jobs.json now has zero jobs while the pre-update snapshot held
one or more, restore the snapshot copy in place and warn loudly. The
check is conservative — it only restores on unambiguous evidence of loss
(snapshot had jobs, live file readable-and-empty), so a user who genuinely
cleared their jobs is never second-guessed and an unreadable live file is
left untouched so real corruption still surfaces.

Wired into _cmd_update_impl after migrate_config(), reusing the existing
pre-update quick snapshot (which already captures cron/jobs.json).

Closes #34600
2026-05-29 13:22:54 -07:00
teknium1
38c4f8c371 test(gateway): update system-unit cwd assertion to HERMES_HOME anchor
test_system_unit_has_no_root_paths asserted the system unit's
WorkingDirectory was the remapped *checkout* path
(/home/alice/.hermes/hermes-agent). That is the brittle pin this PR
fixes — the system unit now anchors cwd at the target user's HERMES_HOME
(/home/alice/.hermes). The test's intent (no root-home leak, target-user
paths present) is unchanged and still holds.
2026-05-29 12:36:59 -07:00
teknium1
a1cb5fa2c7 fix(gateway): anchor service WorkingDirectory at HERMES_HOME, not the source checkout
The systemd unit (and launchd plist) pinned WorkingDirectory to PROJECT_ROOT
(the checkout the unit was generated from). When that checkout is transient —
a git worktree, or a clone hermes update later relocates/removes — the path
rots. systemd then fails the start at the CHDIR step (status=200/CHDIR) BEFORE
Python loads, so the on-boot refresh_systemd_unit_if_needed() self-heal never
runs and Restart=always crash-loops forever on a dead directory. Observed in
the wild: a gateway that crash-looped 153 times overnight, bot offline until a
manual 'hermes gateway restart' regenerated the unit.

Anchor cwd at HERMES_HOME instead — it never moves, always exists, and the
gateway never needed cwd to be the checkout (ExecStart uses an absolute python
+ -m hermes_cli.main). Existing broken units now differ from the generated unit
and self-heal on the next start/restart/update.
2026-05-29 12:36:59 -07:00
teknium1
8836b3a113 fix(cli): widen Windows .bat wrapper fix to custom-name alias path
The profile alias --name path in main.py rewrote the wrapper with a
hardcoded #!/bin/sh script right after create_wrapper_script(), clobbering
the .bat on Windows and reintroducing the exact bug for custom aliases.

create_wrapper_script() now takes an optional target so the alias file is
named after the alias while the -p content references the profile — one
platform-aware code path, no post-hoc rewrite.
2026-05-29 12:32:47 -07:00
liuhao1024
6312dd8c3a fix(cli): create .bat wrapper on Windows instead of POSIX shell script
On Windows, hermes profile create produced a #!/bin/sh script that the
shell cannot execute.  Now creates a .bat file with @echo off + %* on
Windows, and keeps the POSIX shell script on macOS/Linux.

Also fixes check_alias_collision to use 'where' instead of 'which' on
Windows, and remove_wrapper_script to find .bat files.

Fixes #34708
2026-05-29 12:32:47 -07:00
zapabob
aa283d1e4f fix(model): isolate custom provider picker credentials 2026-05-29 12:32:35 -07:00
Teknium
27a2c4f36f
fix(mcp): stop reporting false OAuth success when no token was obtained (#34807)
* docs(code-execution): document HERMES_* env narrowing + passthrough workaround

The execute_code sandbox-child env scrub (108397726, #27303) deliberately
dropped the broad HERMES_ prefix passthrough, keeping only an operational
4-var allowlist (HERMES_HOME/PROFILE/CONFIG/ENV). A script that relied on a
non-secret HERMES_* var (HERMES_BASE_URL, HERMES_KANBAN_DB, HERMES_*_WEBHOOK,
or a plugin-defined one) now sees it unset in the child.

Document the behavior change and the two recovery routes (terminal.env_passthrough
in config.yaml, or required_environment_variables in skill frontmatter), plus
the debug log line that surfaces the drop for diagnosis.

* fix(mcp): stop reporting false OAuth success when no token was obtained

`hermes mcp login` reported "Authenticated — N tool(s) available" for
servers that serve tools/list without auth (e.g. Google's official Drive
MCP server) even when the OAuth flow never completed — dynamic client
registration 400'd because the provider doesn't support RFC 7591, so no
token was ever acquired. Every real tool call then hung until timeout
with no indication of why.

Login now verifies a token actually landed on disk after the probe. When
it didn't, it warns that authentication didn't complete and shows the
config needed to supply a pre-registered client_id/client_secret (the
existing, already-supported workaround for DCR-less providers).

Adds a docs pitfall for Google Drive / Atlassian-style providers.

Fixes #34775
2026-05-29 12:32:19 -07:00
alt-glitch
a4c18f65d4 feat(video_gen): wire Nous subscription override into hermes tools UX
Add the same managed-gateway UX that image_gen already has:

- TOOL_CATEGORIES['video_gen'] gets a 'Nous Subscription' provider row
  with managed_nous_feature='video_gen' + video_gen_plugin_name='fal'
- NousSubscriptionFeatures gains a video_gen property + feature state
  computation (managed/active/available using the fal-queue gateway)
- _GATEWAY_TOOL_LABELS, _GATEWAY_DIRECT_LABELS, _ALL_GATEWAY_KEYS,
  _get_gateway_direct_credentials, opted_in all include video_gen
- apply_nous_managed_defaults and apply_gateway_defaults handle video_gen
- _is_toolset_satisfied checks Nous features for video_gen
- _is_provider_active detects managed video_gen (use_gateway + fal provider)
- _select_plugin_video_gen_provider accepts use_gateway kwarg, propagated
  from all 4 call sites in _configure_provider when managed_feature is set
- hermes setup status shows 'Video Generation (FAL via Nous subscription)'

Users on a Nous subscription can now pick 'Nous Subscription' under
hermes tools → Video Generation, which sets video_gen.provider=fal +
video_gen.use_gateway=true. The FAL plugin's _resolve_managed_fal_video_gateway
then routes through the managed queue gateway — no FAL_KEY needed.
2026-05-29 22:26:24 +05:30
teknium1
1c53d39eaa test: deflake process-registry kill + PTY resize tests
Two CI flakes surfaced on PR #34572 (both in files this PR doesn't touch;
pre-existing host-dependent flakes):

1. test_process_registry::TestPopenLeakOnSetupFailure — the failure-cleanup
   tests use a fake proc.pid (8888/9999) and assert proc.kill() runs. But
   spawn_local's primary cleanup is os.killpg(os.getpgid(pid), SIGKILL),
   falling back to proc.kill() only on ProcessLookupError/PermissionError/
   OSError. When the fake PID happens to exist on a busy host, os.getpgid
   succeeds, os.killpg fires against an UNRELATED real process group, and
   proc.kill() is never reached -> flaky AssertionError (and a real risk of
   SIGKILLing an innocent process group from a unit test). Patch os.getpgid
   to raise ProcessLookupError so the fallback path runs deterministically
   and no real killpg is ever issued.

2. test_web_server::test_resize_escape_is_forwarded — the receive loop calls
   the blocking conn.receive_bytes() with no exception guard. Once the child
   prints its winsize and exits, the PTY closes; on a missed-marker run the
   next recv blocks until the 30s pytest-timeout instead of failing fast.
   Add a try/except break (matching the working sibling tests) and bump the
   child's pre-read sleep 0.15s -> 0.5s so the resize reliably lands first.

Verified: 4/4 pass across 3 consecutive runs; root cause for #1 reproduced
(os.getpgid(1) succeeds -> old code skips proc.kill).
2026-05-29 04:22:41 -07:00
kshitijk4poor
a22c250001 refactor(auth): remove vestigial Nous min_key_ttl/inference_auth_mode params
After the legacy session-key path was removed, two parameters became dead
surface on the Nous runtime-resolution chain:

- min_key_ttl_seconds: del'd inside refresh_nous_oauth_pure and pass-through /
  telemetry-only in refresh_nous_oauth_from_state, _try_import_shared_nous_state,
  _nous_device_code_login, and resolve_nous_runtime_credentials. It controlled the
  now-deleted agent-key mint TTL and drives no behavior.
- inference_auth_mode: with the legacy mode gone, AUTO and FRESH are behaviorally
  identical; the value only fed _normalize_nous_inference_auth_mode validation and
  oauth trace output, never a branch.

Removing inference_auth_mode orphaned its whole supporting cluster
(NOUS_INFERENCE_AUTH_MODE_AUTO/FRESH, NOUS_INFERENCE_AUTH_MODES,
_normalize_nous_inference_auth_mode), and dropping min_key_ttl_seconds orphaned
DEFAULT_AGENT_KEY_MIN_TTL_SECONDS — all deleted here.

Updated every caller (run_agent, auxiliary_client, credential_pool, proxy adapter,
runtime_provider, web_server, main, auth_commands, setup) and pruned the matching
test kwargs. Deleted two tests that exercised the removed surface
(test_legacy_auth_mode_is_rejected, test_try_refresh_..._accepts_explicit_auth_mode).

No behavior change: net -134 LOC of dead code.
2026-05-29 02:24:48 -07:00
Robin Fernandes
4e4984a11a test(auth): update nous jwt-only expectations 2026-05-29 02:24:48 -07:00
Robin Fernandes
7e958dafc2 fix(auth): address Nous JWT fallback review 2026-05-29 02:24:48 -07:00
Robin Fernandes
41ff6e5937 refactor(auth): Disable Nous legacy session key fallback 2026-05-29 02:24:48 -07:00
Teknium
c01a2df0a3
fix(auth): don't launch a text-mode browser inside the terminal for OAuth (#34479)
OAuth auto-open only checked _is_remote_session() (SSH + cloud-shell env
vars). On a headless/CLI-only Linux box with no GUI browser, none of those
trip, so webbrowser.open() resolved to a console browser (w3m/lynx/links)
and launched it INSIDE the terminal — hijacking the user's TTY with the
xAI 'Account Management' login page instead of letting them copy the URL.

Add _can_open_graphical_browser(): returns False when webbrowser would
resolve to a known console browser, when $BROWSER names one, when there's
no display server on Linux, or when no browser resolves at all. Gate all 5
OAuth auto-open callsites (xAI loopback, Spotify loopback, MiniMax device
code, Anthropic, Google) on it in addition to the existing remote check.
Headless boxes now print the URL / fall through to manual-paste instead.
2026-05-29 01:23:06 -07:00
wysie
f32b66c758 fix: improve plugins list usability 2026-05-29 00:59:42 -07:00
teknium1
bc736ff543 test(model-catalog): use exact URL equality in fallback tests
CodeQL flagged 'hermes-agent.nousresearch.com' in url and similar substring
checks as py/incomplete-url-substring-sanitization. The rule is about URL
allowlist checks in production code, not test routing — there's no
security boundary here. Switch to url == self.PRIMARY / self.FALLBACK,
which is the same semantic and silences the rule.
2026-05-29 00:25:36 -07:00
teknium1
f2d88c820c fix(model-catalog): fall through to raw.github when Vercel 403s; swap step-3.5-flash for step-3.7-flash on OpenRouter+Nous
The docs site (Vercel) serves /docs/api/model-catalog.json behind a bot
mitigation rule that returns HTTP 403 + x-vercel-mitigated: challenge for
non-browser User-Agents — including urllib (what the CLI uses) and curl.
When that happens, get_catalog() falls back to the stale disk cache and
new model releases (Opus 4.8, etc.) never reach the /model picker even
though they're already in OPENROUTER_MODELS and the live OpenRouter API.

Adds a fallback URL chain: when the primary catalog URL fails, walk
DEFAULT_CATALOG_FALLBACK_URLS — currently the raw.githubusercontent.com
copy of the same file. GitHub raw doesn't bot-gate, so the manifest stays
reachable through Vercel firewall hiccups. Per-provider override URLs
keep their direct-fetch semantics (operators configure those specifically,
no implicit fallback).

Also swaps stepfun/step-3.5-flash for stepfun/step-3.7-flash in the
OpenRouter + Nous Portal curated picker lists. Native stepfun provider
configuration (api.stepfun.ai) is left alone — that depends on what
stepfun.ai itself serves, not what OpenRouter routes.

Test plan: 5 new TestFallbackChain tests cover primary-success,
primary-failure-fallback-success, all-fail, primary==fallback-dedup, and
end-to-end get_catalog routing through the new helper. Existing 23 tests
in test_model_catalog.py still pass (28 total). Wider tests/hermes_cli/
sweep: 5701/5701 pass.
2026-05-29 00:25:36 -07:00
kshitijk4poor
66827f8947 chore: prune unused imports and duplicate import redefinitions
Remove unused imports (F401) and duplicate/shadowed import
redefinitions (F811) across the codebase using ruff's safe
autofixes. No behavioral changes -- imports only.

- ~1400 safe autofixes applied across 644 files (net -1072 lines)
- __init__.py re-exports preserved (excluded from F401 removal so
  public re-export surfaces stay intact)
- Re-exports that are imported or monkeypatched by tests but look
  unused in their defining module are kept with explicit # noqa:
  F401 (gateway/run.py load_dotenv; run_agent re-exports from
  agent.message_sanitization, agent.context_compressor,
  agent.retry_utils, agent.prompt_builder, agent.process_bootstrap,
  agent.codex_responses_adapter)
- Unsafe F841 (unused-variable) fixes deliberately skipped -- those
  can change behavior when the RHS has side effects
- ruff lints remain disabled in pyproject.toml (only PLW1514 is
  selected); this is a one-time cleanup, not a config change

Verification:
- python -m compileall: clean
- pytest --collect-only: all 27161 tests collect (zero import errors)
- core entry points import clean (run_agent, model_tools, cli,
  toolsets, hermes_state, batch_runner, gateway)
- static scan: every name any test imports directly from an edited
  module still resolves
2026-05-28 22:26:25 -07:00
Teknium
69b74c15a3
fix(kanban): CLI dispatch honors max_in_progress/max_spawn from config; swap missing 'avoid-ai-writing' skill for bundled humanizer (#33488, #29415) (#34337)
Two small bugs in the kanban dispatcher's CLI surface that were
silently degrading two distinct workflows. Bundled because the test
files and the surrounding code surface overlap.

## #33488: hermes kanban dispatch ignored kanban.max_in_progress / max_spawn

The CLI wrapper in hermes_cli/kanban.py:_cmd_dispatch only passed
default_assignee and max_in_progress_per_profile through to
dispatch_once. The global concurrency cap (kanban.max_in_progress)
and the per-tick spawn limit (kanban.max_spawn) were silently dropped,
so operators using 'hermes kanban dispatch' as a one-shot or in a
custom loop couldn't reach either cap from config — only the gateway
embedded dispatcher honored them.

Fix: read both keys from config in the same coerce-positive-int
helper that already handled max_in_progress_per_profile. CLI --max
still wins over config kanban.max_spawn when both are present
(explicit operator signal beats default), but absent --max falls
back to config.

## #29415: synthesizer crashed in retry loop on missing skill

hermes_cli/kanban_swarm.py:212 hardcoded skills=['avoid-ai-writing'],
a skill that doesn't exist in the bundled skills/ directory or any
registered hub source. Every synthesizer worker spawn failed at CLI
startup with 'Unknown skill(s): avoid-ai-writing' before the agent
loop even started — the dispatcher retried up to failure_limit
(default 2), then auto-blocked the task, then dependency rules could
re-promote it, looping forever until manual intervention.

Fix: replace with 'humanizer' which is bundled at
skills/creative/humanizer/SKILL.md (description: 'Humanize text:
strip AI-isms and add real voice'). That's the obvious intent behind
the 'avoid-ai-writing' name, and the skill is platform-portable
(linux/macos/windows) so it works on every supported runtime.

## Tests

tests/hermes_cli/test_kanban_cli_dispatch_passthrough.py — 4 cases:
- CLI passes max_in_progress / max_spawn / default_assignee /
  max_in_progress_per_profile from config to dispatch_once
- CLI --max flag overrides config kanban.max_spawn
- Invalid cap values (0, -1, 'abc', '1.5') silently fall through to None
- kanban_swarm.py no longer references 'avoid-ai-writing' AND the
  replacement 'humanizer' skill exists at the expected on-disk path

Kanban suite: 468/468 pass (was 464; +4 new regression tests).
2026-05-28 21:00:46 -07:00
Ben
a618789dba fix(dashboard-auth): share /api/* public allowlist between legacy and OAuth gates
Two parallel public-path allowlists drifted: _PUBLIC_API_PATHS in
hermes_cli/web_server.py (legacy _SESSION_TOKEN middleware) and
_GATE_PUBLIC_PREFIXES in hermes_cli/dashboard_auth/middleware.py
(OAuth gate). The legacy list included /api/status (documented as a
non-sensitive read-only liveness target); the OAuth gate's list did not.

Effect: every wildcard-subdomain agent surfaced as STARTING/down to the
portal even though the dashboard was serving correctly. Nous account
service (src/server/agents/fly-provider.ts
getInstanceRuntimeStatus) fetches ``/api/status`` without a cookie
as its sole liveness probe; the OAuth gate's 401 looked identical to
'agent dead' on the portal side.

Fix: lift the allowlist into hermes_cli/dashboard_auth/public_paths.py
and have both middlewares import it. _path_is_public now consults
the shared frozenset first, then falls back to the gate's
auth-bootstrap/static prefix list. Future additions to the public list
hit both gates automatically.

Endpoint inventory (verified safe to remain public):

* /api/status            — version, gateway state, active session count,
                           auth-gate shape. Portal liveness probe target.
* /api/config/defaults   — config-defaults feed for the SPA's Config page
* /api/config/schema     — config schema for the SPA's Config page
* /api/model/info        — model catalogue metadata (context windows)
* /api/dashboard/themes  — theme manifests for the skin engine
* /api/dashboard/plugins — plugin manifests for the dashboard

No user data, no session content, no secrets. Same shape an external
monitoring agent would hit on /healthz.

Tests:

* New: test_gated_status_is_public (regression guard with the NAS
  fly-provider.ts liveness-probe rationale spelled out in the docstring)
* New: test_other_public_api_paths_are_public_under_gate (parametrised
  over the rest of PUBLIC_API_PATHS — proves 401 / 302-to-login is
  never the response)
* New: docker integration check #3 in
  test_dashboard_oauth_gate_engaged_by_default — /api/status
  remains 200 under the gate AND reports auth_required=True so the
  portal can distinguish modes
* Updated: test_full_login_round_trip_unlocks_gated_api now probes
  /api/sessions instead of /api/status (status is public, so it
  can no longer distinguish 'logged in' from 'gate accidentally
  disabled')
* Updated: TestApi401Envelope (the no-cookie / invalid-cookie /
  dead-cookie tests) probes /api/sessions for the same reason
* Updated: docker integration check #2 in
  test_dashboard_oauth_gate_engaged_by_default probes
  /api/sessions to prove the gate is intercepting
* Removed: dead _login() helper in
  test_dashboard_auth_status_endpoint.py (no longer needed since
  /api/status is reachable cold)

Companion to docs/handover/hermes-agent-dashboard-s6-insecure-fix.md
(the --insecure flag fix that shipped earlier).
2026-05-29 12:17:12 +10:00
Teknium
3b6347af15
feat(kanban): default_assignee fallback + per-profile concurrency cap (#27145, #21582) (#34244)
Two related dispatcher behaviors that have been missing for a while.

## kanban.default_assignee (#27145)

Reporter (@agarzon): dashboard creates a task without an assignee, task
parks in 'ready' forever even though the operator's intent ('default')
is perfectly clear. The dispatcher already had a 'skipped_unassigned'
bucket but no fallback routing — users had to manually type 'default'
in the assignee field every time.

Behavior: when 'kanban.default_assignee' is set in config.yaml, the
dispatcher applies that assignee to any unassigned ready task before
deciding whether to spawn. The row is mutated (assignee column + an
'assigned' event with source='kanban.default_assignee' for the audit
trail). Empty/whitespace config value = no fallback, preserving the
existing skipped_unassigned behavior.

Dry-run mode reports what WOULD happen via the new
'auto_assigned_default' bucket on DispatchResult, but does NOT mutate
the DB — operators using 'hermes kanban dispatch --dry-run' see the
routing decision before committing.

## kanban.max_in_progress_per_profile (#21582)

Reporter (@edwardchenchen, @simlu, 4 reactions): fan-out workloads
saturate one profile's local model / API quota / browser pool while
other profiles sit idle. The existing global 'max_in_progress' caps
total workers but doesn't balance across profiles.

Behavior: when 'kanban.max_in_progress_per_profile' is set to a
positive int, the dispatcher tracks per-assignee running counts (one
query at tick start) and refuses to spawn for any assignee already at
the cap. Tasks blocked this way go to a new
'skipped_per_profile_capped' bucket on DispatchResult as
(task_id, assignee, current_running_count) tuples — NOT an
operator-actionable failure, just 'try again next tick when the
profile has capacity'.

Pre-existing 'running' tasks count against the cap (verified via
regression test). The cap respects dry_run mode by incrementing
its in-memory counter on each would-be spawn so dry_run reports
the same balanced subset that a real tick would.

Invalid cap values (0, negative, non-int, None) are treated as 'no
cap', preserving the existing behavior. Backward-compatible for
installs that don't set the config.

## Surfaces

- 'hermes kanban dispatch' CLI now prints 'Auto-assigned to
  kanban.default_assignee=X: ...' and 'Deferred (X at per-profile cap,
  N running): ...' lines, plus matching JSON keys in --json output.
- Gateway dispatcher logs the configured values at startup
  ('default_assignee=X', 'max_in_progress_per_profile=N').
- 'kanban.max_in_progress_per_profile' added to DEFAULT_CONFIG with
  inline docs.

## Validation

- tests/hermes_cli/test_kanban_default_assignee.py (6 cases): no-cap
  baseline, auto-assign + DB mutation, dry-run reports without
  mutating, whitespace treated as None, explicit assignees untouched,
  DispatchResult field schema.
- tests/hermes_cli/test_kanban_per_profile_cap.py (9 cases including
  4 parametrized): no-cap baseline, balanced 2-profile fan-out,
  pre-existing running counts against cap, invalid cap values
  (0/-1/'abc'/None), capped tasks dispatched on next tick after
  running task completes, DispatchResult field schema.
- Broader kanban suite: 464/464 pass (was 449 baseline; +15 new
  regression tests across both features).

## Credit

#27145 — Jimmy Johansson reported the dispatcher skipped-unassigned
gap; @agarzon scoped the simpler 'honor kanban.default_assignee' fix
that matches the existing config knob.
#21582 — @edwardchenchen filed the per-profile cap ask after hitting
model 429s on fan-out research projects; @simlu confirmed the same
pain on local-model setups.
2026-05-28 19:02:55 -07:00
Teknium
769ee86cd2
feat(kanban): attach images referenced in task bodies to worker vision (#34210)
Kanban workers now scan the task body for local image paths and
http(s) image URLs and attach them to the worker's first user turn —
matching the CLI/gateway behaviour for inbound images. Before, a
user pasting `/home/me/screenshot.png` or `https://example.com/img.png`
into a kanban task description had it sent to the model as plain
text and the pixels were never seen.

How it works:
* agent/image_routing.py gains extract_image_refs(text) → (paths, urls)
  that mirrors gateway/platforms/base.py:extract_local_files (absolute /
  ~-relative paths, image extensions only, ignores fenced/inline code).
* build_native_content_parts() accepts an optional image_urls= kwarg
  and emits passthrough image_url parts for remote URLs alongside the
  base64 data: URLs used for local paths.
* cli.py (single-query/quiet branch — the path every dispatcher-spawned
  worker takes) detects HERMES_KANBAN_TASK, reads the task body via
  kanban_db.get_task, runs extract_image_refs, and threads the results
  into the existing image-routing decision (native vs text). Best-effort:
  enrichment failures never block worker startup.

Tested:
* tests/agent/test_image_routing.py — 22 new tests for extract_image_refs
  and URL pass-through in build_native_content_parts.
* tests/hermes_cli/test_kanban_worker_image_extraction.py — 10 new tests
  driving real kanban_db round-trip (create task → read body → extract
  refs → build parts).
* E2E: created a fake kanban task with a body referencing both a local
  PNG and an https URL; verified the worker pipeline produces a
  multimodal user turn with 1 text part + 2 image_url parts (data URL
  for the local file, passthrough URL for the remote).
2026-05-28 17:50:42 -07:00
teknium1
f30db14ced fix(kanban): SIGTERM on worker must terminate the process (#28181)
The single-query signal handler in cli.py raises KeyboardInterrupt on
SIGTERM/SIGHUP. For interactive 'hermes chat -q' that unwinds the main
thread cleanly. For kanban workers spawned by the dispatcher, the
worker process is likely to have a non-daemon thread alive (terminal
_wait_for_process, custom plugins, etc.). With KeyboardInterrupt only
the main thread unwinds; the non-daemon thread keeps the process alive,
the gateway has already restarted, and the dispatcher's _pid_alive
check returns True forever — task stuck in 'running' indefinitely.

When HERMES_KANBAN_TASK is set (dispatcher-spawned worker), flush
logging + stdout/stderr, then os._exit(0) instead of raising
KeyboardInterrupt. The kernel reclaims the PID immediately, and the
existing zombie-state detection in _pid_alive flips the task to
crashed on the next dispatcher tick. detect_crashed_workers then
re-spawns it on the following tick — no manual recovery needed.

A SIGALRM(2s) deadman is armed before the flush so a pathological
blocking-I/O flush can't wedge the worker forever. In practice the
reporter measured flush in <1ms; the alarm is a failsafe, never
the common path.

Interactive (non-kanban) chat -q is unchanged — the env-gated branch
only fires for dispatcher-spawned workers.

Live verification on this machine:
- Without HERMES_KANBAN_TASK + non-daemon thread alive: process hangs
  alive 4+ seconds after SIGTERM. Dispatcher's _pid_alive returns
  True → task stuck.
- With HERMES_KANBAN_TASK + same non-daemon thread: process exits in
  0.10s via os._exit(0). Dispatcher reclaims on next tick.

Tests:
- tests/hermes_cli/test_signal_handler_kanban_worker.py (3 cases):
  end-to-end subprocess test with a non-daemon thread,
  HERMES_KANBAN_TASK env, SIGTERM, dispatcher-style _pid_alive check.
  Plus a source-level invariant test catching future refactors that
  drop the env-gated exit.
- 452/452 kanban tests pass.

Co-authored-by: andrewhosf <andrewho.sf@gmail.com>
2026-05-28 11:59:58 -07:00
Teknium
3a9bc9d88a
fix(model picker): unify /model and hermes model lists, add disk cache (#33867)
* fix(model picker): unify /model and `hermes model` model lists, add disk cache

The /model slash picker and `hermes model` were drifting apart. /model
read the raw static `OPENROUTER_MODELS` list (31 entries, including 5
that fail at runtime — no tool-call support or absent from live catalog),
while `hermes model` ran the same list through the live OpenRouter
/v1/models tool-support filter and showed 26 valid entries. Same problem
existed for every other authed provider: /model used curated static
lists, `hermes model` used live /v1/models.

Unifies both surfaces on `provider_model_ids()` and adds a generic
disk-cached wrapper so the picker stays snappy.

Changes
- hermes_cli/models.py: new `cached_provider_model_ids()` —
  ~/.hermes/provider_models_cache.json, 1h TTL, per-provider entries
  keyed by credential fingerprint (env vars + OAuth file mtimes).
  Stale-data-beats-no-data on transient failures. Pair with
  `clear_provider_models_cache(provider=None)`.
- hermes_cli/models.py: `provider_model_ids("nous")` now falls back
  to the docs-hosted manifest (not the in-repo snapshot) when the live
  Portal /models call fails — preserves the model_catalog regression
  guarantee while still going through the unified pathway.
- hermes_cli/model_switch.py: `list_authenticated_providers` routes
  sections 1, 2, and 2b through `cached_provider_model_ids(slug)` with
  curated fallback when the live fetcher comes up empty.
- hermes_cli/model_switch.py: `parse_model_flags` extended to a
  4-tuple, parses `--refresh`.
- cli.py / gateway/run.py / tui_gateway/server.py: updated unpacking;
  CLI + gateway wire `--refresh` to `clear_provider_models_cache()`.
- hermes_cli/main.py: `hermes model --refresh` argparse flag.
- hermes_cli/commands.py: `/model` args_hint advertises `--refresh`.
- tests/hermes_cli/test_inventory.py: refresh stale comment.

Live PTY parity verification
- /model → OpenRouter row: `(26 models)` (was 31, with broken entries)
- `hermes model` → OpenRouter: 26 models (unchanged)
- The 5 dropped entries: `pareto-code` (no tool-call support),
  `gemini-3-pro-image-preview` (no tool-call support),
  `elephant-alpha`, `hy3-preview:free`, `ring-2.6-1t:free` (gone
  from OpenRouter's live catalog).

Live PTY timing
- First /model open, empty cache: 4624 ms (full network round trip
  across every authed provider)
- Second /model open, warm cache: 51 ms (90× faster)
- `/model --refresh` clears the disk cache and re-fetches.

Cache schema (~/.hermes/provider_models_cache.json, ~3 KB):
  { "anthropic": {"fp": "<sha256:16>", "at": 1748..., "models": [...]},
    ... }

Targeted tests: tests/hermes_cli/ + gateway model tests + tui_gateway —
5855/5855 pass.

* fix(model picker): use blake2b for cache fingerprint to silence CodeQL

py/weak-sensitive-data-hashing flagged the sha256 call in
_credential_fingerprint() as a high-severity alert because the input
includes env var values whose names contain *_API_KEY / *_TOKEN.

The hash is used solely as a cache-bust identity — never reversed, never
stored, collisions are harmless (worst case: cache miss → live re-fetch).
blake2b serves the same purpose and isn't flagged by this rule.

Functional behavior identical: 16-hex-char digest, cache hit/miss logic
unchanged. Live re-verified — 26 OpenRouter models, warm-cache 78ms.
2026-05-28 11:33:16 -07:00
kshitij
a82c88bac0
fix(xai-oauth): accept bare-code manual paste (state=None) (#26923) (#33880)
xAI's consent page renders the authorization code in-page rather than
redirecting through the 127.0.0.1 callback, so on remote/headless setups
(GCP Cloud Shell, Codespaces, container consoles, headless VPS) the only
value the user can paste is the opaque code with no `code=`/`state=`
query parameters. `_parse_pasted_callback` correctly returns
`state=None` for that input, but `_xai_oauth_loopback_login` then
validated state unconditionally and raised `xai_state_mismatch`,
making the documented bare-code paste path unreachable.

PKCE (code_verifier) still binds the token exchange to this client,
so the local state-equality check is redundant when there is no state
to compare. On the manual-paste path only, substitute the locally
generated state when the callback returned none — the rest of the
validation chain (code presence, error field, token exchange) is
unchanged. The loopback HTTP-server path still requires a matching
state (a real browser redirect always carries one).

Also: clarify the manual-paste prompt to mention xAI's in-page code
rendering so users know pasting the bare code on its own is expected.

Root-cause analysis from #26923 comment by @AccursedGalaxy (2026-05-20).

Tests
-----
* test_xai_loopback_login_manual_paste_bare_code_succeeds — positive
  end-to-end through the token exchange with state=None.
* test_xai_loopback_login_loopback_path_rejects_missing_state — the
  HTTP-server path still rejects state=None as a regression guard
  (the bare-code relaxation must NOT widen the loopback path).
* Existing test_xai_loopback_login_manual_paste_state_mismatch_raises
  continues to verify wrong (non-None) state is rejected on manual-paste.

Closes #26923.
2026-05-28 05:47:30 -07:00
Teknium
e0572a6def
fix(skills-hub): stop ellipsis-truncating the Identifier column (#33810)
`hermes skills search` rendered the Identifier column with the default
overflow behaviour, so long slugs (notably browse-sh — every browse-sh
skill ends in a `-XXXXXX` hash that's part of the identifier) were cut
to `browse-sh/weathe…`. Users copied the visible string into
`hermes skills install` and got a not-found error because the hash was
gone.

Set overflow="fold" on the Identifier column in both search tables
(`do_search` and the `_resolve_short_name` multi-match table) so long
slugs wrap onto a second line instead of getting eaten. Also add a
`--json` flag to `hermes skills search` (and the `/skills search`
slash variant) for scripting — emits a list of {name, identifier,
source, trust_level, description} objects with the full identifier,
which is the right shape for copy-paste pipelines too.

Closes #33674.
2026-05-28 04:53:13 -07:00
teknium1
6f9182cb34 fix(kanban): content-addressed corrupt-DB backup filename
Repeated quarantines of an unchanged corrupt kanban.db used to amplify
disk usage by N: the gateway dispatcher's 5-minute retry loop, multi-
profile fleets sharing one DB, and manual reopen attempts each produced
a fresh '.corrupt.<timestamp>.bak' copy of the same bytes. After 10
retries on a 100KB DB you had 11x the disk footprint of duplicate
corrupt data.

Derive the backup filename from a sha256 of the main DB instead of a
timestamp + collision counter. Same bytes → same filename → skip the
copy on retries. Different bytes (partial repair, further damage) →
different filename → preserve separately. Sidecar (-wal/-shm) backups
inherit the same content-addressed name.

Inspired by @hanzckernel's PR #33529, simplified down to ~30 LOC: drop
the persistent JSON marker file, drop the atomic temp+fsync+rename
helper (shutil.copy2 is fine for a quarantine-only path), drop the
gateway-side WAL/SHM fingerprint extension (the existing
(path, mtime, size) tuple still gives the 5-minute retry semantics it
needs), and drop the gateway-side helper extraction. The backup file
existing IS the marker; no separate state needed.

Test: tests/hermes_cli/test_kanban_db.py::test_repeated_corrupt_open_reuses_single_backup
proves 10 retries on the same corrupt bytes produce 1 backup (was 11),
and mutating the corrupt bytes produces a second backup with a
different fingerprint.

Refs #33529
Co-authored-by: hanzckernel <zhicheng.han@mathematik.uni-goettingen.de>
2026-05-28 03:38:09 -07:00
Teknium
432a691758
fix(update): stream + idle-kill npm run build so a stalled webui-build can't soft-brick the install (#33803)
`hermes update` ran the webui build with `capture_output=True` and no timeout. On low-memory hosts (WSL2's 4 GB default, small VPSes, antivirus stalls) Vite goes silent for minutes; users see a frozen terminal, decide the update is hung, and reboot. The reboot lands *after* `pip install -e .` has already touched the install but *before* the build completes, leaving the `hermes` launcher in place while `hermes_cli` is no longer importable — i.e. `ModuleNotFoundError: No module named 'hermes_cli'` (#33788, same class as #32384).

Changes:

- New `_run_with_idle_timeout()` helper: streams subprocess output line-by-line (so the user sees Vite progress in real time) and kills the process if no bytes appear on stdout/stderr for 180s. The existing stale-dist fallback (#23817) then serves the previous build instead of failing the update.
- `_build_web_ui()` uses the helper for `npm run build` (the actual stall site). `npm install` keeps `subprocess.run` + capture_output to preserve the existing EPERM-retry-on-Windows contract.
- Both `cmd_update` call sites print `→ Core update complete. Building dashboard (optional)...` before the webui build. The CLI is fully functional at this point; a webui-build failure only affects `hermes dashboard`. Telegraphing the boundary explicitly stops users from rebooting through the build step.

Tests:

- `tests/hermes_cli/test_run_with_idle_timeout.py` — 4 tests covering streaming success, nonzero exit, idle-kill, and missing-binary cases. Uses real `subprocess.Popen` on tiny Python scripts; isolated in its own file so per-file canonical-runner parallelism doesn't pair it with the mock-heavy tests.
- `tests/hermes_cli/test_web_ui_build.py` — updated existing tests to patch `_run_with_idle_timeout` for the build step in addition to `subprocess.run` for the install step.
- `tests/hermes_cli/test_cmd_update.py::test_update_refreshes_repo_and_tui_node_dependencies` — same update.

Full suite: `scripts/run_tests.sh tests/hermes_cli/` → 5646 passed, 0 failed.

Fixes #33788.
2026-05-28 03:34:47 -07:00
Teknium
10ee4a729b
fix(gateway): drain on Windows hermes gateway stop so sessions survive restart (#33798)
Sessions now survive `hermes gateway stop` / `restart` on native Windows.
Previously the gateway died on schtasks `/End` + os.kill SIGTERM without
ever running the drain loop, so the v0.13.0 session-resume feature (#21192)
silently broke on Windows: `resume_pending=True` was never written, and
the next boot started with a blank conversation history (issue #33778).

Root cause is twofold and the reporter only identified half of it:

1. `hermes_cli/gateway_windows.py::stop()` did not write the
   `planned_stop_marker` before signalling. The reporter caught this.

2. The bigger reason: `asyncio.add_signal_handler` raises
   NotImplementedError for SIGTERM/SIGINT on Windows, so even if the
   marker had been written, the gateway's existing SIGTERM handler
   (which is what calls `runner.stop()` and the `mark_resume_pending`
   loop) was never invoked. Writing the marker would have been
   necessary-but-insufficient.

The fix has two parts:

* gateway/run.py: new `_run_planned_stop_watcher` daemon thread polls
  for the planned-stop marker file every 0.5s. When the marker appears
  it `loop.call_soon_threadsafe(shutdown_signal_handler, None)` — the
  same shutdown path a real SIGTERM would have driven, including the
  pre-drain `mark_resume_pending` writes (run.py:5977) and graceful
  drain wait. The existing signal handler already accepts
  `received_signal=None` and falls through to
  `consume_planned_stop_marker_for_self()`, so no handler changes
  needed. Runs on every platform as cheap belt-and-suspenders.

* hermes_cli/gateway_windows.py: `stop()` now writes the marker for
  the running gateway PID and waits up to `agent.restart_drain_timeout`
  (default 30s) for the PID to exit cleanly. On clean drain, the kill
  sweep is non-forceful; on timeout, escalates to
  `kill_gateway_processes(force=True)` which routes to taskkill /T /F
  per `references/windows-native-support.md`.

Validation:

* 7 new tests in tests/gateway/test_planned_stop_watcher.py covering:
  marker→handler dispatch, no-marker idle, already-draining skip,
  not-yet-running skip, stop_event responsiveness, fire-once
  semantics, error tolerance.
* 8 new tests in tests/hermes_cli/test_gateway_windows.py covering:
  marker-before-kill ordering, clean-drain skips force-kill,
  drain-timeout escalates to force=True, no-pid-skips-drain,
  invalid-pid handling, fast-exit success, timeout failure,
  marker-write-failure tolerance.
* E2E (Linux, detached orphan): write_planned_stop_marker(pid) +
  `_drain_gateway_pid(pid, 5.0)` returns True in 0.5s after the
  victim sees the marker and exits. Tested with a double-forked
  subprocess so the test parent isn't holding it as a zombie.
* Targeted: tests/gateway/{restart_drain,restart_resume_pending,
  signal,signal_format,status,shutdown_forensics,approve_deny_commands,
  planned_stop_watcher} + tests/hermes_cli/{gateway_windows,
  gateway_service} → 519/519.

What was wrong with the reporter's claim (for future archaeology): they
described the symptom as "no `resume_pending=True` written to
`sessions.json`" — but Hermes uses `state.db` (SQLite), not
`sessions.json`, and `mark_resume_pending` is called regardless of
the marker (the marker only affects exit code 0 vs 1 for systemd
revival semantics). The real session-loss path is the missing drain
on Windows, not a missing marker. Both halves are fixed here.

Closes #33778.
2026-05-28 03:25:32 -07:00
teknium1
c7f7783e5c test(xai-proxy): regression coverage for #28932 429 handling
Three new tests in tests/hermes_cli/test_proxy.py:

- xai_adapter_retry_rotates_pool_entry_on_429 — headline #28932 case.
  Two-entry pool, 429 on first entry, must rotate to second entry
  AND must NOT call refresh_xai_oauth_pure (refresh is irrelevant
  for rate limits).
- xai_adapter_retry_returns_none_on_429_when_pool_exhausted —
  single-entry pool: 429 returns None so the rate-limit response
  flows back to the client unchanged (existing behavior preserved).
- xai_adapter_retry_returns_none_for_unrelated_status — non-{401,
  429} statuses must not trigger any retry path at all; guards
  against the gate becoming too broad in future changes.

Each test asserts that refresh_xai_oauth_pure is never called on the
429 path — refresh is a 401-specific concern.

39/39 in tests/hermes_cli/test_proxy.py.
2026-05-28 02:36:37 -07:00
Dusk1e
aa3466063b fix(android): reject unsafe tar members in psutil compatibility installer 2026-05-28 02:36:09 -07:00
Teknium
09a5cd8084
fix(auth): sync manual:device_code Codex pool entries on re-auth (#33744)
#33164 made _save_codex_tokens sync the singleton-seeded `device_code`
pool entry on Codex OAuth re-auth. That fixed the #33000 path but missed
`manual:device_code` entries created by `hermes auth add openai-codex`
(the recommended workaround for users who hit #33000 before #33164
landed).

Every subsequent re-auth would refresh the device_code entry but leave
the manual:device_code entry holding the consumed refresh token plus
stale last_error_* markers — immediately recreating the 401
token_invalidated symptom on the next request, exactly as reported in
#33538.

Extend the refreshable source set to include `manual:device_code`.
Completing the device-code OAuth flow proves the user owns the ChatGPT
account, so it is safe to refresh every device-code-backed entry. Keep
`manual:api_key` and other non-device-code manual sources untouched —
those represent independent credentials.

Closes #33538.
2026-05-28 01:33:10 -07:00
Stephen Schoettler
8595281f3c fix: expose context engine tools with saved toolsets 2026-05-28 00:28:42 -07:00
LeonSGP43
442a9203c0 Fix xAI OAuth timeout manual fallback 2026-05-28 00:24:17 -07:00
Robin Fernandes
1cf5e639b3 fix(auth): refresh Nous entitlement in tool menus 2026-05-28 00:19:31 -07:00
Robin Fernandes
406901b27d feat(auth) normalise the way in which we check whether a user has free/paid access to nous portal so we can expose behaviour and error messages accordingly. 2026-05-28 00:19:31 -07:00
Squiddy
3ba8962738 fix(kanban): add Windows init lock guard 2026-05-27 23:28:51 -07:00
Squiddy
90b6b3d18f fix(kanban): harden sqlite connection concurrency 2026-05-27 23:28:51 -07:00
Ben
875d930ac7 test(docker-update): stub subprocess.run in git-install regression guard
The regression-guard test
`test_cmd_update_on_git_install_does_not_print_docker_message` mocked
`is_managed` and `detect_install_method` but not `subprocess.run`, so
once `cmd_update(check=True)` decided this was a git install it shelled
out to a real `git fetch upstream` / `git fetch origin`. On CI runners
the worktree has no `upstream` remote configured and the fetch hung
past the 30s pytest-timeout — test (4) slice failed in #33659 CI.

Fix: stub `subprocess.run` with a successful CompletedProcess-shaped
object whose stdout is `"0\n"`, so:
  - no real git command is ever invoked
  - the rev-list parsing later in the flow (`int(stdout.strip())`)
    succeeds rather than `ValueError`-ing through the test's
    SystemExit catch
  - the flow proceeds far enough to confirm the docker banner is
    absent (the actual assertion)

Also broaden the except clause to `(SystemExit, Exception)`: the only
assertion in this test is the negative-banner check on captured stdout;
any further failure in the rest of the update flow is irrelevant to
that contract.

Verified locally: all 7 tests in
`tests/hermes_cli/test_cmd_update_docker.py` pass in 0.39s (previously
the regression-guard test alone consumed 30s+ and got SIGTERM'd).
2026-05-28 15:50:25 +10:00
Ben
b924b22a9d fix(docker): hermes update prints docker pull guidance instead of bogus git error
Inside the published Docker image, `hermes update` was hitting the
".git missing → reinstall via curl" fallback:

    ✗ Not a git repository. Please reinstall:
      curl -fsSL https://raw.githubusercontent.com/.../install.sh | bash

That message is wrong on two counts:
  1. It tells the user to run the host-side installer, which would
     install a *new* Hermes on the host — not update the running
     container.
  2. It doesn't mention `docker pull` at all, leaving Docker users
     to figure out the right action from scratch.

`hermes update --check` was worse: it bailed with "Not a git
repository — cannot check for updates." and nothing else.

Fix: detect the Docker install method (already stamped by
`docker/stage2-hook.sh` and surfaced by `detect_install_method()`)
in both update entry points and print a long-form message that
covers:

  - The right command: `docker pull nousresearch/hermes-agent:latest`
  - Restart guidance (`docker compose up -d --force-recreate` /
    re-run `docker run`)
  - How to verify the new version after restart
  - Tag-pinning caveat (`:latest` doesn't move a pinned tag)
  - Config persistence across upgrades (state under `HERMES_HOME` /
    `/opt/data` is bind-mounted and survives)
  - Fork escape hatch (build your own image with the repo's Dockerfile)

Exit code is 1 (matches `managed_error` semantic for "tried to
update but can't update this way").

Plumbing:
  - hermes_cli/config.py: new `format_docker_update_message()` helper
    sits next to the existing `_NIX_UPDATE_MSG` /
    `format_managed_message()` family so the wording lives in one
    place and both call sites (apply path + check path) consume it.
  - hermes_cli/main.py:
      * `cmd_update()`: bail right after the `is_managed()` gate, before
        any of the apply-path branches.
      * `_cmd_update_check()`: bail at the top of the function, before
        the existing `method == "pip"` branch.
    Neither path touches subprocess.run / git when method == "docker".

Coverage:
  - 7 new tests in `tests/hermes_cli/test_cmd_update_docker.py`:
      * `hermes update` in Docker → message + exit 1, no git calls
      * `hermes update --check` (via cmd_update) → same
      * `--yes` / `--force` don't bypass (intentional)
      * `_cmd_update_check` called directly → bails too
      * git/pip installs still take their normal paths (regression guards)
      * `format_docker_update_message` content-lock test pinning the
        five user-actionable bits the message must contain
  - Existing test_cmd_update.py (21 tests) + test_managed_installs.py
    (5 tests) still pass — no regression on the source-install path.
  - Verified end-to-end in a real container: `docker run ... update`
    and `docker run ... update --check` both render the message and
    exit 1.
2026-05-28 15:50:25 +10:00
Ben
66489f38c7 fix(docker): bake build-time git SHA into the image
`hermes dump` and the startup banner both call `git rev-parse HEAD` to
report the running commit, but `.dockerignore` line 2 excludes `.git` —
so inside the published image `hermes dump` shows
`version: ... [(unknown)]` and the banner drops its `· upstream <sha>`
suffix entirely.  That makes support triage from container bug reports
impossible: we can't tell which commit the user is actually running.

Fix: thread the build-time SHA through as a Docker build-arg, write it
to `/opt/hermes/.hermes_build_sha` in the image, and have a new
`hermes_cli/build_info.get_build_sha()` read it as a fallback after the
existing live-git lookup fails.  Output format is unchanged in both
callsites — same 8-char short SHA whether resolved live or baked.

Wiring:
  - Dockerfile: `ARG HERMES_GIT_SHA=` + write-file step after the source
    copy.  Empty/missing arg → no file written → callers fall through to
    live git (so local `docker build` without --build-arg is unchanged).
  - docker-publish.yml: passes `HERMES_GIT_SHA=${{ github.sha }}` on all
    four build-push-action steps (amd64/arm64, smoke-test + final push).
  - dump.py:_get_git_commit() / banner.py:get_git_banner_state(): try
    live git first, fall back to baked SHA, then to legacy `(unknown)`
    / None.  Banner returns `upstream == local, ahead=0` because a built
    image is by definition pinned to one commit.

Coverage:
  - Unit tests cover build_info (file present/absent/empty/error,
    truncation, whitespace), dump (live-git wins, both fallbacks,
    identical output-format regression guard), and banner (no-repo +
    baked, no-repo + no-sha, shallow-clone fallback).
  - tests/docker/test_dump_build_sha.py is an integration regression
    guard that runs against the real image, reads
    `/opt/hermes/.hermes_build_sha`, and asserts `hermes dump` surfaces
    its content (or stays at `(unknown)` if no file).
  - Verified end-to-end: `docker build --build-arg HERMES_GIT_SHA=abc...`
    → `docker run ... dump` reports `[abc12345]`; without the build-arg
    it reports `[(unknown)]` as before.
2026-05-28 15:14:05 +10:00
teknium1
ebe04c66cd fix(kanban): close kanban.db FD after every connect() in long-lived processes
`sqlite3.Connection.__exit__` commits/rollbacks but does NOT close the
underlying FD. `with kb.connect() as conn:` in long-lived processes
(gateway `run_slash`, dashboard `decompose_task_endpoint`) therefore
leaks one FD to `kanban.db` per call. After enough operations the
gateway dies with `[Errno 24] Too many open files` (~4 days uptime
in the production report — #33159).

Fix: add a `connect_closing()` context manager in `hermes_cli/kanban_db`
that wraps `connect()` with a real `try/finally: conn.close()`. Switch
the 42 leak-prone call sites in `hermes_cli/kanban.py` (35),
`hermes_cli/kanban_decompose.py` (4), and `hermes_cli/kanban_specify.py`
(3) over to it.

`kanban.py` matters because `run_slash` (called from the gateway for
every `/kanban` slash command) parses argparse and dispatches to those
`_cmd_*` functions in-process — each one was leaking one FD per
invocation.

Tests inside `tests/` are untouched: short-lived processes where OS
cleanup masks the leak. Regression tests added in
`test_kanban_db.py` cover both happy-path and exception-path closure,
plus an explicit assertion that bare `with kb.connect()` still does
NOT close (documenting the upstream sqlite3 behaviour we're working
around).

Closes #33159.
2026-05-27 22:07:49 -07:00
Dusk
c341a2d107
fix(docker): align HOME for dashboard and s6 gateway services (#33481) 2026-05-28 13:42:27 +10:00
Ben Barclay
b345323195 fix(docker): tee supervised gateway stdout to docker logs
Follow-up to #33583 (the gateway-run-supervised redirect).

Before this fix, the supervised gateway's stdout (most visibly the
"Hermes Gateway Starting…" rich-console banner) was swallowed by
`s6-log` into the rotated file at
`${HERMES_HOME}/logs/gateways/<profile>/current` and never reached
`docker logs`. Operational signal lived in two places:

  * **docker logs** — saw stderr (Python `logging` defaults to
    stderr), so warnings/errors were visible.
  * **the rotated file** — saw stdout (rich banners, `print()`
    output, third-party libs that wrote to fd 1).

This was surprising for users coming from the pre-s6 image, where
`docker run … gateway run` produced a single unified stream in
`docker logs`. They'd see partial output, conclude something was
broken, and dig around for the missing pieces.

Fix: add the `1` s6-log action directive before the file destination
so each line is forwarded to s6-log's stdout — which propagates up
the s6-supervise pipeline to /init's stdout = container stdout =
`docker logs`. The file destination is preserved as a second
destination, so the rotated log (with ISO 8601 timestamps) still
exists for `hermes logs` and for survival across container restarts.

Trade-off considered: timestamps. Putting `T` between `1` and the
file destination (not before `1`) means:

  * docker logs sees raw lines — Python's logging formatter has its
    own timestamps, and `docker logs --timestamps` adds another
    layer when desired. No double-stamping in the common reading
    path.
  * The persisted file gets s6-log's ISO 8601 timestamp so even
    output that lacked a Python-logger timestamp (rich banners,
    third-party raw prints) is correlatable in `current`.

Verification:

  * New unit-test assertion in `test_service_manager.py` locks the
    `s6-log 1` directive into the rendered run-script. Mutation-
    tested by reverting to the pre-fix script (no `1`); the assert
    catches it cleanly.
  * New docker-harness test `test_supervised_gateway_stdout_reaches_docker_logs`
    builds the image, runs `docker run … gateway run`, and asserts
    the unique `⚕` banner glyph reaches `docker logs`. Also verifies
    the rotated file still contains the banner (no regression on
    the existing file destination). Mutation-tested end-to-end: built
    a deliberately-broken image without the `1` directive and the
    test failed exactly as designed, citing the banner present in
    `current` but absent from `docker logs`.
  * `website/docs/user-guide/docker.md` gains a new `:::note Where
    gateway logs go` admonition documenting both destinations and
    the audit-log file at `${HERMES_HOME}/logs/container-boot.log`.

Existing functionality preserved: every other docker-harness test
still passes against the new image. Unit-test sweep across
`tests/hermes_cli/` (5561 tests) is green.
2026-05-28 13:18:41 +10:00
copilot-swe-agent[bot]
19419a47d7
Merge origin/main into bb/gui 2026-05-28 03:08:24 +00:00