Closes the last Python-on-Windows UTF-8 exposure by making every
text-mode open() call explicit about its encoding.
Before: on Windows, bare open(path, 'r') defaults to the system
locale encoding (cp1252 on US-locale installs). That means reading
any config/yaml/markdown/json file with non-ASCII content either
crashes with UnicodeDecodeError or silently mis-decodes bytes.
After: all 89 affected call sites in production code now pass
encoding='utf-8' explicitly. Works identically on every platform
and every locale, no surprise behavior.
Mechanical sweep via:
ruff check --preview --extend-select PLW1514 --unsafe-fixes --fix --exclude 'tests,venv,.venv,node_modules,website,optional-skills, skills,tinker-atropos,plugins' .
All 89 fixes have the same shape: open(x) or open(x, mode) became
open(x, encoding='utf-8') or open(x, mode, encoding='utf-8'). Nothing
else changed. Every modified file still parses and the Windows/sandbox
test suite is still green (85 passed, 14 skipped, 0 failed across
tests/tools/test_code_execution_windows_env.py +
tests/tools/test_code_execution_modes.py + tests/tools/test_env_passthrough.py +
tests/test_hermes_bootstrap.py).
Scope notes:
- tests/ excluded: test fixtures can use locale encoding intentionally
(exercising edge cases). If we want to tighten tests later that's
a separate PR.
- plugins/ excluded: plugin-specific conventions may differ; plugin
authors own their code.
- optional-skills/ and skills/ excluded: skill scripts are user-authored
and we don't want to mass-edit them.
- website/ and tinker-atropos/ excluded: vendored / generated content.
46 files touched, 89 +/- lines (symmetric replacement). No behavior
change on POSIX or on Windows when the file is ASCII; bug fix on
Windows when the file contains non-ASCII.
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| default.yaml | ||
| README.md | ||
| run_eval.sh | ||
| yc_bench_env.py | ||
YC-Bench: Long-Horizon Agent Benchmark
YC-Bench by Collinear AI is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialisation across 4 skill domains.
Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns.
Setup
# Install yc-bench (optional dependency)
pip install "hermes-agent[yc-bench]"
# Or install from source
git clone https://github.com/collinear-ai/yc-bench
cd yc-bench && pip install -e .
# Verify
yc-bench --help
Running
# From the repo root:
bash environments/benchmarks/yc_bench/run_eval.sh
# Or directly:
python environments/benchmarks/yc_bench/yc_bench_env.py evaluate \
--config environments/benchmarks/yc_bench/default.yaml
# Override model:
bash environments/benchmarks/yc_bench/run_eval.sh \
--openai.model_name anthropic/claude-opus-4-20250514
# Quick single-preset test:
bash environments/benchmarks/yc_bench/run_eval.sh \
--env.presets '["fast_test"]' --env.seeds '[1]'
How It Works
Architecture
HermesAgentLoop (our agent)
-> terminal tool -> subprocess("yc-bench company status") -> JSON output
-> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON
-> terminal tool -> subprocess("yc-bench sim resume") -> JSON (advance time)
-> ... (100-500 turns per run)
The environment initialises the simulation via yc-bench sim init (NOT yc-bench run, which would start yc-bench's own built-in agent loop). Our HermesAgentLoop then drives all interaction through CLI commands.
Simulation Mechanics
- 4 skill domains: research, inference, data_environment, training
- Prestige system (1.0-10.0): Gates access to higher-paying tasks
- Employee management: Junior/Mid/Senior with domain-specific skill rates
- Throughput splitting:
effective_rate = base_rate / Nactive tasks per employee - Financial pressure: Monthly payroll, bankruptcy = game over
- Deterministic: SHA256-based RNG — same seed + preset = same world
Difficulty Presets
| Preset | Employees | Tasks | Focus |
|---|---|---|---|
| tutorial | 3 | 50 | Basic loop mechanics |
| easy | 5 | 100 | Throughput awareness |
| medium | 5 | 150 | Prestige climbing + domain specialisation |
| hard | 7 | 200 | Precise ETA reasoning |
| nightmare | 8 | 300 | Sustained perfection under payroll pressure |
| fast_test | (varies) | (varies) | Quick validation (~50 turns) |
Default eval runs fast_test + medium + hard × 3 seeds = 9 runs.
Scoring
composite = 0.5 × survival + 0.5 × normalised_funds
- Survival (binary): Did the company avoid bankruptcy?
- Normalised funds (0.0-1.0): Log-scale relative to initial $250K capital
Configuration
Key fields in default.yaml:
| Field | Default | Description |
|---|---|---|
presets |
["fast_test", "medium", "hard"] |
Which presets to evaluate |
seeds |
[1, 2, 3] |
RNG seeds per preset |
max_agent_turns |
200 | Max LLM calls per run |
run_timeout |
3600 | Wall-clock timeout per run (seconds) |
survival_weight |
0.5 | Weight of survival in composite score |
funds_weight |
0.5 | Weight of normalised funds in composite |
horizon_years |
null | Override horizon (null = auto from preset) |
Cost & Time Estimates
Each run is 100-500 LLM turns. Approximate costs per run at typical API rates:
| Preset | Turns | Time | Est. Cost |
|---|---|---|---|
| fast_test | ~50 | 5-10 min | $1-5 |
| medium | ~200 | 20-40 min | $5-15 |
| hard | ~300 | 30-60 min | $10-25 |
Full default eval (9 runs): ~3-6 hours, $50-200 depending on model.
References
- collinear-ai/yc-bench — Official repository
- Collinear AI — Company behind yc-bench
- TerminalBench2 — Per-task coding benchmark (complementary)