Commit graph

2 commits

Author SHA1 Message Date
Teknium
304fb921bf
fix: two process leaks (agent-browser daemons, paste.rs sleepers) (#11843)
Both fixes close process leaks observed in production (18+ orphaned
agent-browser node daemons, 15+ orphaned paste.rs sleep interpreters
accumulated over ~3 days, ~2.7 GB RSS).

## agent-browser daemon leak

Previously the orphan reaper (_reap_orphaned_browser_sessions) only ran
from _start_browser_cleanup_thread, which is only invoked on the first
browser tool call in a process. Hermes sessions that never used the
browser never swept orphans, and the cross-process orphan detection
relied on in-process _active_sessions, which doesn't see other hermes
PIDs' sessions (race risk).

- Write <session>.owner_pid alongside the socket dir recording the
  hermes PID that owns the daemon (extracted into _write_owner_pid for
  direct testability).
- Reaper prefers owner_pid liveness over in-process _active_sessions.
  Cross-process safe: concurrent hermes instances won't reap each
  other's daemons. Legacy tracked_names fallback kept for daemons
  that predate owner_pid.
- atexit handler (_emergency_cleanup_all_sessions) now always runs
  the reaper, not just when this process had active sessions —
  every clean hermes exit sweeps accumulated orphans.

## paste.rs auto-delete leak

_schedule_auto_delete spawned a detached Python subprocess per call
that slept 6 hours then issued DELETE requests. No dedup, no tracking —
every 'hermes debug share' invocation added ~20 MB of resident Python
interpreters that stuck around until the sleep finished.

- Replaced the spawn with ~/.hermes/pastes/pending.json: records
  {url, expire_at} entries.
- _sweep_expired_pastes() synchronously DELETEs past-due entries on
  every 'hermes debug' invocation (run_debug() dispatcher).
- Network failures stay in pending.json for up to 24h, then give up
  (paste.rs's own retention handles the 'user never runs hermes again'
  edge case).
- Zero subprocesses; regression test asserts subprocess/Popen/time.sleep
  never appear in the function source (skipping docstrings via AST).

## Validation

|                              | Before        | After        |
|------------------------------|---------------|--------------|
| Orphan agent-browser daemons | 18 accumulated| 2 (live)     |
| paste.rs sleep interpreters  | 15 accumulated| 0            |
| RSS reclaimed                | -             | ~2.7 GB      |
| Targeted tests               | -             | 2253 pass    |

E2E verified: alive-owner daemons NOT reaped; dead-owner daemons
SIGTERM'd and socket dirs cleaned; pending.json sweep deletes expired
entries without spawning subprocesses.
2026-04-17 18:46:30 -07:00
Teknium
75380de430
fix: reap orphaned browser sessions on startup (#7931)
When a Python process exits uncleanly (SIGKILL, crash, gateway restart
via hermes update), in-memory _active_sessions tracking is lost but the
agent-browser node daemons and their Chromium child processes keep
running indefinitely. On a long-running system this causes unbounded
memory growth — 24 orphaned sessions consumed 7.6 GB on a production
machine over 9 days.

Add _reap_orphaned_browser_sessions() which scans the tmp directory for
agent-browser-{h_*,cdp_*} socket dirs on cleanup thread startup.  For
each dir not tracked by the current process, reads the daemon PID file
and sends SIGTERM if the daemon is still alive.  Handles edge cases:
dead PIDs, corrupt PID files, permission errors, foreign processes.

The reaper runs once on thread startup (not every 30s) to avoid races
with sessions being actively created by concurrent agents.
2026-04-11 14:02:46 -07:00