External rotation (logrotate, manual `mv gateway.log gateway.log.1`,
another process rotating the file) leaves `_ManagedRotatingFileHandler`'s
open fd pinned to the renamed inode. All subsequent writes go to the
rotated backup instead of the file every operator expects to read,
producing the symptom 'gateway.log frozen mid-write while agent.log
keeps growing with gateway.* records'.
PR #16229 fixed the original CLI->gateway init-order bug (#8404) so the
handler attaches in the first place. This is the sibling fix for what
happens after attach, when something external rotates underneath us.
Adds a WatchedFileHandler-style inode check on emit(): if baseFilename
no longer matches the open stream's (dev,ino), close the stale fd and
reopen at the expected path. doRollover() refreshes the snapshot so our
own rollover isn't misidentified as external.
Five regression tests cover the matrix: external rename, external
unlink, external truncate (must NOT trigger reopen — inode unchanged),
normal doRollover() (must still work), and the end-to-end
Allen-reproduction (rotate + re-call setup_logging).
55/55 tests in tests/test_hermes_logging.py pass; 5972/5972 in
tests/gateway/ pass.
Sweep of all CI failures on origin/main, grouped by drift source:
Telegram allowlist gate (db50af910 added user-authz to _should_process_message):
- Hardcoded "[Telegram]" prefix in the logger.warning so the call no
longer dereferences self.name → self.platform, which test fixtures
built via object.__new__ never set.
- test_telegram_format / test_allowed_channels_widening fixtures stub
_is_callback_user_authorized → True so the new gate doesn't reject
guest-mode / allowed-channels test messages.
- test_telegram_approval_buttons::test_update_prompt_callback_not_affected
sets TELEGRAM_ALLOWED_USERS="*" so the fail-closed default doesn't
reject the callback before it writes .update_response.
Approval surface (6d495d9e7 renamed status, 214b95392 detached stdin):
- test_no_callback_returns_approval_required: status is now
"pending_approval" (was "approval_required").
- test_close_stdin_allows_eof_driven_process_to_finish: switch to
use_pty=True; non-PTY now uses stdin=DEVNULL.
Mattermost (send() now resolves root_id via _api_get first):
- test_send_with_thread_reply mocks _session.get with a thread-root
response so the new resolver doesn't TypeError on a bare AsyncMock.
Kanban (d8ad431de rename, f55d94a1e review column, _kanban_worker_skill_available):
- _safe_int → _to_epoch in the two test_kanban_db tests.
- Spawn-skills tests (×3) monkey-patch _kanban_worker_skill_available
to True since the isolated kanban_home fixture has no devops/kanban-worker tree.
- test_gateway_dispatcher_disables_corrupt_board: connect count
3 → 5 (review-column probe now also runs per tick).
Aux-config severity at_or_above (a94ddd807):
- test_diagnostics_endpoint_severity_filter expects warning filter to
include error+critical now (was exact-match).
Anthropic error handling (conversation loop extracted from run_agent):
- _no_backoff_wait fixture patches BOTH run_agent.jittered_backoff AND
agent.conversation_loop.jittered_backoff. The latter is the actual
call site; without the second patch tests burn ~2s per retry and
hit the 30s SIGALRM timeout on CI.
Other test pollution / drift:
- test_auto_does_not_select_copilot_from_github_token: patch
agent.bedrock_adapter.has_aws_credentials → False so boto3's
credential chain can't auto-pick Bedrock from developer ~/.aws.
- test_setup_openclaw_migration: patch hermes_cli.gateway.get_env_value
in addition to setup_mod.get_env_value — _platform_status reads
through the gateway module's binding.
- test_gateway_prefix: COMPONENT_PREFIXES["gateway"] now includes
"hermes_plugins" too.
- test_recommended_update_command_defaults_to_hermes_update: also
short-circuit get_managed_update_command in case a stray
~/.hermes/.managed marker is present.
- test_user_id_is_not_explicit: _parse_target_ref now returns
is_explicit=False for Slack U.../W... IDs (chat.postMessage rejects
them — a DM must be opened first via conversations.open).
* feat: component-separated logging with session context and filtering
Phase 1 — Gateway log isolation:
- gateway.log now only receives records from gateway.* loggers
(platform adapters, session management, slash commands, delivery)
- agent.log remains the catch-all (all components)
- errors.log remains WARNING+ catch-all
- Moved gateway.log handler creation from gateway/run.py into
hermes_logging.setup_logging(mode='gateway') with _ComponentFilter
Phase 2 — Session ID injection:
- Added set_session_context(session_id) / clear_session_context() API
using threading.local() for per-thread session tracking
- _SessionFilter enriches every log record with session_tag attribute
- Log format: '2026-04-11 10:23:45 INFO [session_id] logger.name: msg'
- Session context set at start of run_conversation() in run_agent.py
- Thread-isolated: gateway conversations on different threads don't leak
Phase 3 — Component filtering in hermes logs:
- Added --component flag: hermes logs --component gateway|agent|tools|cli|cron
- COMPONENT_PREFIXES maps component names to logger name prefixes
- Works with all existing filters (--level, --session, --since, -f)
- Logger name extraction handles both old and new log formats
Files changed:
- hermes_logging.py: _SessionFilter, _ComponentFilter, COMPONENT_PREFIXES,
set/clear_session_context(), gateway.log creation in setup_logging()
- gateway/run.py: removed redundant gateway.log handler (now in hermes_logging)
- run_agent.py: set_session_context() at start of run_conversation()
- hermes_cli/logs.py: --component filter, logger name extraction
- hermes_cli/main.py: --component argument on logs subparser
Addresses community request for component-separated, filterable logging.
Zero changes to existing logger names — __name__ already provides hierarchy.
* fix: use LogRecord factory instead of per-handler _SessionFilter
The _SessionFilter approach required attaching a filter to every handler
we create. Any handler created outside our _add_rotating_handler (like
the gateway stderr handler, or third-party handlers) would crash with
KeyError: 'session_tag' if it used our format string.
Replace with logging.setLogRecordFactory() which injects session_tag
into every LogRecord at creation time — process-global, zero per-handler
wiring needed. The factory is installed at import time (before
setup_logging) so session_tag is available from the moment hermes_logging
is imported.
- Idempotent: marker attribute prevents double-wrapping on module reload
- Chains with existing factory: won't break third-party record factories
- Removes _SessionFilter from _add_rotating_handler and setup_verbose_logging
- Adds tests: record factory injection, idempotency, arbitrary handler compat
* feat(nix): shared-state permission model for interactive CLI users
Enable interactive CLI users in the hermes group to share full
read-write state (sessions, memories, logs, cron) with the gateway
service via a setgid + group-writable permission model.
Changes:
nix/nixosModules.nix:
- Directories use setgid 2770 (was 0750) so new files inherit the
hermes group. home/ stays 0750 (no interactive write needed).
- Activation script creates HERMES_HOME subdirs (cron, sessions, logs,
memories) — previously Python created them but managed mode now skips
mkdir.
- Activation migrates existing runtime files to group-writable (chmod
g+rw). Nix-managed files (config.yaml, .env, .managed) stay 0640/0644.
- Gateway systemd unit gets UMask=0007 so files it creates are 0660.
hermes_cli/config.py:
- ensure_hermes_home() splits into managed/unmanaged paths. Managed mode
verifies dirs exist (raises RuntimeError if not) instead of creating
them. Scoped umask(0o007) ensures SOUL.md is created as 0660.
hermes_logging.py:
- _ManagedRotatingFileHandler subclass applies chmod 0660 after log
rotation in managed mode. RotatingFileHandler.doRollover() creates new
files via open() which uses the process umask (0022 → 0644), not the
scoped umask from ensure_hermes_home().
Verified with a 13-subtest NixOS VM integration test covering setgid,
interactive writes, file ownership, migration, and gateway coexistence.
Refs: #6044
* Fix managed log file mode on initial open
Co-authored-by: Siddharth Balyan <alt-glitch@users.noreply.github.com>
* refactor: simplify managed file handler and merge activation loops
- Cache is_managed() result in handler __init__ instead of lazy-importing
on every _open()/_chmod_if_managed() call. Avoids repeated stat+env
checks on log rotation.
- Merge two for-loops over the same subdir list in activation script
into a single loop (mkdir + chown + chmod + find in one pass).
---------
Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: Siddharth Balyan <alt-glitch@users.noreply.github.com>
* fix: repair 57 failing CI tests across 14 files
Categories of fixes:
**Test isolation under xdist (-n auto):**
- test_hermes_logging: Strip ALL RotatingFileHandlers before each test
to prevent handlers leaked from other xdist workers from polluting counts
- test_code_execution: Force TERMINAL_ENV=local in setUp — prevents Modal
AuthError when another test leaks TERMINAL_ENV=modal
- test_timezone: Same TERMINAL_ENV fix for execute_code timezone tests
- test_codex_execution_paths: Mock _resolve_turn_agent_config to ensure
model resolution works regardless of xdist worker state
**Matrix adapter tests (nio not installed in CI):**
- Add _make_fake_nio() helper with real response classes for isinstance()
checks in production code
- Replace MagicMock(spec=nio.XxxResponse) with fake_nio instances
- Wrap production method calls with patch.dict('sys.modules', {'nio': ...})
so import nio succeeds in method bodies
- Use try/except instead of pytest.importorskip for nio.crypto imports
(importorskip can be fooled by MagicMock in sys.modules)
- test_matrix_voice: Skip entire file if nio is a mock, not just missing
**Stale test expectations:**
- test_cli_provider_resolution: _prompt_provider_choice now takes **kwargs
(default param added); mock getpass.getpass alongside input
- test_anthropic_oauth_flow: Mock getpass.getpass (code switched from input)
- test_gemini_provider: Mock models.dev + OpenRouter API lookups to test
hardcoded defaults without external API variance
- test_code_execution: Add notify_on_complete to blocked terminal params
- test_setup_openclaw_migration: Mock prompt_choice to select 'Full setup'
(new quick-setup path leads to _require_tty → sys.exit in CI)
- test_skill_manager_tool: Patch get_all_skills_dirs alongside SKILLS_DIR
so _find_skill searches tmp_path, not real ~/.hermes/skills/
**Missing attributes in object.__new__ test runners:**
- test_platform_reconnect: Add session_store to _make_runner()
- test_session_race_guard: Add hooks, _running_agents_ts, session_store,
delivery_router to _make_runner()
**Production bug fix (gateway/run.py):**
- Fix sentinel eviction race: _AGENT_PENDING_SENTINEL was immediately
evicted by the stale-detection logic because sentinels have no
get_activity_summary() method, causing _stale_idle=inf >= timeout.
Guard _should_evict with 'is not _AGENT_PENDING_SENTINEL'.
* fix: address remaining CI failures
- test_setup_openclaw_migration: Also mock _offer_launch_chat (called at
end of both quick and full setup paths)
- test_code_execution: Move TERMINAL_ENV=local to module level to protect
ALL test classes (TestEnvVarFiltering, TestExecuteCodeEdgeCases,
TestInterruptHandling, TestHeadTailTruncation) from xdist env leaks
- test_matrix: Use try/except for nio.crypto imports (importorskip can be
fooled by MagicMock in sys.modules under xdist)