hermes-agent/environments/benchmarks/yc_bench
Teknium cbce5e93fc codebase: add encoding='utf-8' to all bare open() calls (PLW1514)
Closes the last Python-on-Windows UTF-8 exposure by making every
text-mode open() call explicit about its encoding.

Before: on Windows, bare open(path, 'r') defaults to the system
locale encoding (cp1252 on US-locale installs).  That means reading
any config/yaml/markdown/json file with non-ASCII content either
crashes with UnicodeDecodeError or silently mis-decodes bytes.

After: all 89 affected call sites in production code now pass
encoding='utf-8' explicitly.  Works identically on every platform
and every locale, no surprise behavior.

Mechanical sweep via:
  ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix     --exclude 'tests,venv,.venv,node_modules,website,optional-skills,               skills,tinker-atropos,plugins' .

All 89 fixes have the same shape: open(x) or open(x, mode) became
open(x, encoding='utf-8') or open(x, mode, encoding='utf-8').  Nothing
else changed.  Every modified file still parses and the Windows/sandbox
test suite is still green (85 passed, 14 skipped, 0 failed across
tests/tools/test_code_execution_windows_env.py +
tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py +
tests/test_hermes_bootstrap.py).

Scope notes:
  - tests/ excluded: test fixtures can use locale encoding intentionally
    (exercising edge cases).  If we want to tighten tests later that's
    a separate PR.
  - plugins/ excluded: plugin-specific conventions may differ; plugin
    authors own their code.
  - optional-skills/ and skills/ excluded: skill scripts are user-authored
    and we don't want to mass-edit them.
  - website/ and tinker-atropos/ excluded: vendored / generated content.

46 files touched, 89 +/- lines (symmetric replacement).  No behavior
change on POSIX or on Windows when the file is ASCII; bug fix on
Windows when the file contains non-ASCII.
2026-05-08 14:27:40 -07:00
..
__init__.py feat: add YC-Bench long-horizon agent benchmark environment 2026-03-06 19:25:56 -08:00
default.yaml fix: update OpenRouter model names for yc-bench config 2026-03-06 19:58:56 -08:00
README.md feat: add YC-Bench long-horizon agent benchmark environment 2026-03-06 19:25:56 -08:00
run_eval.sh feat: add YC-Bench long-horizon agent benchmark environment 2026-03-06 19:25:56 -08:00
yc_bench_env.py codebase: add encoding='utf-8' to all bare open() calls (PLW1514) 2026-05-08 14:27:40 -07:00

YC-Bench: Long-Horizon Agent Benchmark

YC-Bench by Collinear AI is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialisation across 4 skill domains.

Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns.

Setup

# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"

# Or install from source
git clone https://github.com/collinear-ai/yc-bench
cd yc-bench && pip install -e .

# Verify
yc-bench --help

Running

# From the repo root:
bash environments/benchmarks/yc_bench/run_eval.sh

# Or directly:
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
    --config environments/benchmarks/yc_bench/default.yaml

# Override model:
bash environments/benchmarks/yc_bench/run_eval.sh \
    --openai.model_name anthropic/claude-opus-4-20250514

# Quick single-preset test:
bash environments/benchmarks/yc_bench/run_eval.sh \
    --env.presets '["fast_test"]' --env.seeds '[1]'

How It Works

Architecture

HermesAgentLoop (our agent)
  -> terminal tool -> subprocess("yc-bench company status") -> JSON output
  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON
  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON (advance time)
  -> ... (100-500 turns per run)

The environment initialises the simulation via yc-bench sim init (NOT yc-bench run, which would start yc-bench's own built-in agent loop). Our HermesAgentLoop then drives all interaction through CLI commands.

Simulation Mechanics

  • 4 skill domains: research, inference, data_environment, training
  • Prestige system (1.0-10.0): Gates access to higher-paying tasks
  • Employee management: Junior/Mid/Senior with domain-specific skill rates
  • Throughput splitting: effective_rate = base_rate / N active tasks per employee
  • Financial pressure: Monthly payroll, bankruptcy = game over
  • Deterministic: SHA256-based RNG — same seed + preset = same world

Difficulty Presets

Preset Employees Tasks Focus
tutorial 3 50 Basic loop mechanics
easy 5 100 Throughput awareness
medium 5 150 Prestige climbing + domain specialisation
hard 7 200 Precise ETA reasoning
nightmare 8 300 Sustained perfection under payroll pressure
fast_test (varies) (varies) Quick validation (~50 turns)

Default eval runs fast_test + medium + hard × 3 seeds = 9 runs.

Scoring

composite = 0.5 × survival + 0.5 × normalised_funds
  • Survival (binary): Did the company avoid bankruptcy?
  • Normalised funds (0.0-1.0): Log-scale relative to initial $250K capital

Configuration

Key fields in default.yaml:

Field Default Description
presets ["fast_test", "medium", "hard"] Which presets to evaluate
seeds [1, 2, 3] RNG seeds per preset
max_agent_turns 200 Max LLM calls per run
run_timeout 3600 Wall-clock timeout per run (seconds)
survival_weight 0.5 Weight of survival in composite score
funds_weight 0.5 Weight of normalised funds in composite
horizon_years null Override horizon (null = auto from preset)

Cost & Time Estimates

Each run is 100-500 LLM turns. Approximate costs per run at typical API rates:

Preset Turns Time Est. Cost
fast_test ~50 5-10 min $1-5
medium ~200 20-40 min $5-15
hard ~300 30-60 min $10-25

Full default eval (9 runs): ~3-6 hours, $50-200 depending on model.

References