hermes-agent/docs/rca-ssl-cacert-post-git-pull.md
chromalinx a218a0f156 fix(agent,gateway,doctor): add SSL CA cert bundle fail-fast guard
A stale certifi CA bundle after a partial `hermes update` used to crash
the agent on the first outbound HTTPS call with a raw traceback and
trap the gateway in a retry loop.

This patch:

* Adds `agent/errors.py` with a typed `SSLConfigurationError`
* Adds `agent/ssl_guard.py` with a `verify_ca_bundle()` pre-flight
  that asserts the bundle exists, is non-trivial in size, and can build
  a working SSLContext. On macOS, it falls back to the system trust
  store when the bundle is empty but the system store is healthy
  (covers corporate proxies / MDM setups).
* Wires the guard into `run_agent.py` and `gateway/run.py` right
  after the `hermes_bootstrap` import, inside a try/except so a bug
  in the guard itself can never prevent startup.
* Adds a `SSL / CA Certificates` section to `hermes_cli doctor` so
  users can detect the failure with one command.
* Adds unit tests covering the healthy, missing, empty, skip-env, and
  macOS-fallback paths.
* Adds an RCA document describing the failure mode and the recovery
  path (`pip install -e .`).

When the bundle is broken the user sees:

    \u26a0\ufe0f SSL certificate bundle issue detected.
       Run: pip install -e .

`HERMES_SKIP_SSL_GUARD=1` disables the check for sandboxed
environments that ship their own trust store.
2026-06-13 21:14:32 -07:00

47 lines
2.3 KiB
Markdown

# RCA: SSL CA cert bundle corruption after `hermes update`
**Status:** resolved by `fix(agent,gateway): add SSL CA cert bundle fail-fast guard`
**Severity:** P2 — degrades the agent into a crash-loop until the user re-installs deps.
## Summary
A `git pull` (or `hermes update`) that lands new code without finishing `uv pip install -e .` leaves the certifi CA bundle stale or missing on disk. The first outbound HTTPS call (OpenAI, Telegram, Discord, etc.) then crashes with a raw `ssl.SSLCertVerificationError` and Hermes enters a crash-loop, surfacing only a traceback to the user.
## Root cause
`certifi.where()` returns the path to the CA bundle shipped by the `certifi` package inside the active venv. When the venv is partially refreshed (new `certifi` files copied but old certs in the wheel cache, or a half-deleted install), the bundle can be:
- **missing** (file removed but Python still imports the package),
- **empty / truncated** (partial write),
- **unloadable** (cert format mismatch on a Python upgrade).
Hermes used to let those failures bubble up uncaught, so the gateway would log a stacktrace and the agent would retry the same broken network call on the next turn.
## Fix
`agent/ssl_guard.py` runs a `verify_ca_bundle()` pre-flight right after the `hermes_bootstrap` import in both `run_agent.py` and `gateway/run.py`. It:
1. Resolves the certifi bundle path,
2. Asserts the file exists and is at least 1 KB,
3. Builds an `ssl.SSLContext` from it,
4. Falls back to the system trust store on macOS when the bundle is empty but the system store works (covers corporate proxies / MDM setups),
5. Raises a typed `SSLConfigurationError` with a clear remediation hint otherwise.
`run_agent.py` and `gateway/run.py` import the guard in a guarded `try/except` so a bug in the guard itself cannot prevent startup — we log a warning and continue.
`hermes_cli doctor` now exposes a `SSL / CA Certificates` section so users can detect the failure with a single command.
## Recovery
When the guard fires, the user sees:
```
⚠️ SSL certificate bundle issue detected.
Run: pip install -e .
```
`pip install -e .` (or the equivalent `uv pip install -e .`) reinstalls certifi and restores the bundle.
## Environment escape hatch
Set `HERMES_SKIP_SSL_GUARD=1` to bypass the check. Intended for sandboxed environments that ship their own trust store.