Extract the islink/realpath guard from the 16743 fix into a single
atomic_replace() helper in utils.py, then migrate every os.replace()
call site in the codebase to use it.
The original PR #16777 correctly identified and fixed the bug, but
only patched 9 of ~24 call sites. The same bug class (managed
deployments that symlink state files silently losing the link on
every write) still existed at auth.json, sessions file, gateway
config, env_loader, webhook subscriptions, debug store, model
catalog, pairing, google OAuth, nous rate guard, and more.
Rather than add another 10+ copies of the same three-line guard,
consolidate into atomic_replace(tmp, target) which:
- resolves symlinks via os.path.realpath before os.replace
- returns the resolved real path so callers can re-apply permissions
- is a drop-in replacement for os.replace at the use sites
Changes:
- utils.py: new atomic_replace() helper + atomic_json_write /
atomic_yaml_write now call it instead of inlining the guard
- 16 files: all os.replace() call sites migrated to atomic_replace()
- agent/{google_oauth, nous_rate_guard, shell_hooks}.py
- cron/jobs.py
- gateway/{pairing, session, platforms/telegram}.py
- hermes_cli/{auth, config, debug, env_loader, model_catalog, webhook}.py
- tools/{memory_tool, skill_manager_tool, skills_sync}.py
Tests: tests/test_atomic_replace_symlinks.py pins the invariant for
atomic_replace + atomic_json_write + atomic_yaml_write, covers plain
files, first-time creates, broken symlinks, and permission preservation.
Refs #16743
Builds on #16777 by @vominh1919.
Load-time sanitizer silently removed non-ASCII codepoints from any
env var ending in _API_KEY / _TOKEN / _SECRET / _KEY, turning
copy-paste artifacts (Unicode lookalikes, ZWSP, NBSP) into opaque
provider-side API_KEY_INVALID errors.
Warn once per key to stderr with the offending codepoints (U+XXXX)
and guidance to re-copy from the provider dashboard.
API keys containing Unicode lookalike characters (e.g. ʋ U+028B instead
of v) cause UnicodeEncodeError when httpx encodes the Authorization
header as ASCII. This commonly happens when users copy-paste keys from
PDFs, rich-text editors, or web pages with decorative fonts.
Three layers of defense:
1. **Save-time validation** (hermes_cli/config.py):
_check_non_ascii_credential() strips non-ASCII from credential values
when saving to .env, with a clear warning explaining the issue.
2. **Load-time sanitization** (hermes_cli/env_loader.py):
_sanitize_loaded_credentials() strips non-ASCII from credential env
vars (those ending in _API_KEY, _TOKEN, _SECRET, _KEY) after dotenv
loads them, so the rest of the codebase never sees non-ASCII keys.
3. **Runtime recovery** (run_agent.py):
The UnicodeEncodeError recovery block now also sanitizes self.api_key
and self._client_kwargs['api_key'], fixing the gap where message/tool
sanitization succeeded but the API key still caused httpx to fail on
the Authorization header.
Also: hermes_logging.py RotatingFileHandler now explicitly sets
encoding='utf-8' instead of relying on locale default (defensive
hardening for ASCII-locale systems).
When .env files become corrupted (e.g. concatenated KEY=VALUE pairs on
a single line due to concurrent writes or encoding issues), both
python-dotenv and load_env() would parse the entire concatenated string
as a single value. This caused bot tokens to appear duplicated up to 8×,
triggering InvalidToken errors from the Telegram API.
Root cause: _sanitize_env_lines() — which correctly splits concatenated
lines — was only called during save_env_value() writes, not during reads.
Fix:
- load_env() now calls _sanitize_env_lines() before parsing
- env_loader.load_hermes_dotenv() sanitizes the .env file on disk
before python-dotenv reads it, so os.getenv() also returns clean values
- Added tests reproducing the exact corruption pattern from #8908Closes#8908
Hermes startup entrypoints now load ~/.hermes/.env and project fallback env files with user config taking precedence over stale shell-exported values. This makes model/provider/base URL changes in .env actually take effect after restarting Hermes. Adds a shared env loader plus regression coverage, and reproduces the original bug case where OPENAI_BASE_URL and HERMES_INFERENCE_PROVIDER remained stuck on old shell values before import.