mirror of https://github.com/NousResearch/hermes-agent.git synced 2026-05-09 03:11:58 +00:00

History

Teknium cbce5e93fc codebase: add encoding='utf-8' to all bare open() calls (PLW1514) Closes the last Python-on-Windows UTF-8 exposure by making every text-mode open() call explicit about its encoding. Before: on Windows, bare open(path, 'r') defaults to the system locale encoding (cp1252 on US-locale installs). That means reading any config/yaml/markdown/json file with non-ASCII content either crashes with UnicodeDecodeError or silently mis-decodes bytes. After: all 89 affected call sites in production code now pass encoding='utf-8' explicitly. Works identically on every platform and every locale, no surprise behavior. Mechanical sweep via: ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' . All 89 fixes have the same shape: open(x) or open(x, mode) became open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing else changed. Every modified file still parses and the Windows/sandbox test suite is still green (85 passed, 14 skipped, 0 failed across tests/tools/test_code_execution_windows_env.py + tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py + tests/test_hermes_bootstrap.py). Scope notes: - tests/ excluded: test fixtures can use locale encoding intentionally (exercising edge cases). If we want to tighten tests later that's a separate PR. - plugins/ excluded: plugin-specific conventions may differ; plugin authors own their code. - optional-skills/ and skills/ excluded: skill scripts are user-authored and we don't want to mass-edit them. - website/ and tinker-atropos/ excluded: vendored / generated content. 46 files touched, 89 +/- lines (symmetric replacement). No behavior change on POSIX or on Windows when the file is ASCII; bug fix on Windows when the file contains non-ASCII.		2026-05-08 14:27:40 -07:00
..
__init__.py	feat: add YC-Bench long-horizon agent benchmark environment	2026-03-06 19:25:56 -08:00
default.yaml	fix: update OpenRouter model names for yc-bench config	2026-03-06 19:58:56 -08:00
README.md	feat: add YC-Bench long-horizon agent benchmark environment	2026-03-06 19:25:56 -08:00
run_eval.sh	feat: add YC-Bench long-horizon agent benchmark environment	2026-03-06 19:25:56 -08:00
yc_bench_env.py	codebase: add encoding='utf-8' to all bare open() calls (PLW1514)	2026-05-08 14:27:40 -07:00

README.md

YC-Bench: Long-Horizon Agent Benchmark

YC-Bench by Collinear AI is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialisation across 4 skill domains.

Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns.

Setup

# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"

# Or install from source
git clone https://github.com/collinear-ai/yc-bench
cd yc-bench && pip install -e .

# Verify
yc-bench --help

Running

# From the repo root:
bash environments/benchmarks/yc_bench/run_eval.sh

# Or directly:
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
    --config environments/benchmarks/yc_bench/default.yaml

# Override model:
bash environments/benchmarks/yc_bench/run_eval.sh \
    --openai.model_name anthropic/claude-opus-4-20250514

# Quick single-preset test:
bash environments/benchmarks/yc_bench/run_eval.sh \
    --env.presets '["fast_test"]' --env.seeds '[1]'

How It Works

Architecture

HermesAgentLoop (our agent)
  -> terminal tool -> subprocess("yc-bench company status") -> JSON output
  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON
  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON (advance time)
  -> ... (100-500 turns per run)

The environment initialises the simulation via yc-bench sim init (NOT yc-bench run, which would start yc-bench's own built-in agent loop). Our HermesAgentLoop then drives all interaction through CLI commands.

Simulation Mechanics

4 skill domains: research, inference, data_environment, training
Prestige system (1.0-10.0): Gates access to higher-paying tasks
Employee management: Junior/Mid/Senior with domain-specific skill rates
Throughput splitting: effective_rate = base_rate / N active tasks per employee
Financial pressure: Monthly payroll, bankruptcy = game over
Deterministic: SHA256-based RNG — same seed + preset = same world

Difficulty Presets

Preset	Employees	Tasks	Focus
tutorial	3	50	Basic loop mechanics
easy	5	100	Throughput awareness
medium	5	150	Prestige climbing + domain specialisation
hard	7	200	Precise ETA reasoning
nightmare	8	300	Sustained perfection under payroll pressure
fast_test	(varies)	(varies)	Quick validation (~50 turns)

Default eval runs fast_test + medium + hard × 3 seeds = 9 runs.

Scoring

composite = 0.5 × survival + 0.5 × normalised_funds

Survival (binary): Did the company avoid bankruptcy?
Normalised funds (0.0-1.0): Log-scale relative to initial $250K capital

Configuration

Key fields in default.yaml:

Field	Default	Description
`presets`	`["fast_test", "medium", "hard"]`	Which presets to evaluate
`seeds`	`[1, 2, 3]`	RNG seeds per preset
`max_agent_turns`	200	Max LLM calls per run
`run_timeout`	3600	Wall-clock timeout per run (seconds)
`survival_weight`	0.5	Weight of survival in composite score
`funds_weight`	0.5	Weight of normalised funds in composite
`horizon_years`	null	Override horizon (null = auto from preset)

Cost & Time Estimates

Each run is 100-500 LLM turns. Approximate costs per run at typical API rates:

Preset	Turns	Time	Est. Cost
fast_test	~50	5-10 min	$1-5
medium	~200	20-40 min	$5-15
hard	~300	30-60 min	$10-25

Full default eval (9 runs): ~3-6 hours, $50-200 depending on model.

References

collinear-ai/yc-bench — Official repository
Collinear AI — Company behind yc-bench
TerminalBench2 — Per-task coding benchmark (complementary)

README.md Unescape Escape