mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-04-25 00:51:20 +00:00
Hermes Agent identified and patched its own prompting blind spots through
automated self-evaluation — running 64+ tool-use benchmarks across GPT-5.4
and Codex-5.3, diagnosing 5 failure modes, writing targeted prompt patches,
and verifying the fix in a closed loop.
Failure modes discovered and fixed:
- Mental arithmetic (wrong answers: 39,152,053 vs correct 39,151,253)
- User profile hallucination ('Windows 11' when running on Linux)
- Time guessing without verification
- Clarification-seeking instead of acting ('open where?' for port checks)
- Hash computation from memory (SHA-256, encodings)
- Confusing system RAM with agent's own persistent memory store
Two new XML sections added to OPENAI_MODEL_EXECUTION_GUIDANCE:
- <mandatory_tool_use>: explicit categories that must always use tools
- <act_dont_ask>: default to action on obvious interpretations
Results:
gpt-5.4: 68.8% → 100% tool compliance (+31.2pp)
gpt-5.3-codex: 62.5% → 100% tool compliance (+37.5pp)
Regression: 0/8 conversational prompts over-tooled
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| anthropic_adapter.py | ||
| auxiliary_client.py | ||
| builtin_memory_provider.py | ||
| context_compressor.py | ||
| context_references.py | ||
| copilot_acp_client.py | ||
| credential_pool.py | ||
| display.py | ||
| insights.py | ||
| memory_manager.py | ||
| memory_provider.py | ||
| model_metadata.py | ||
| models_dev.py | ||
| prompt_builder.py | ||
| prompt_caching.py | ||
| redact.py | ||
| retry_utils.py | ||
| skill_commands.py | ||
| skill_utils.py | ||
| smart_model_routing.py | ||
| subdirectory_hints.py | ||
| title_generator.py | ||
| trajectory.py | ||
| usage_pricing.py | ||