mirror of
https://github.com/NousResearch/hermes-agent.git
synced 2026-05-18 04:41:56 +00:00
feat(skill): darwinian-evolver optional skill
Thin wrapper around Imbue's darwinian_evolver (AGPL-3.0, subprocess-only). Ships a working OpenRouter driver (parrot_openrouter.py), a snapshot inspector (show_snapshot.py), and a custom-problem template. SKILL.md has 58-char description, Pitfalls sourced from actually running the loop: non-viable seed trap, Azure content filter killing runs, loop.run() being a generator, nested-pickle snapshots, and aggressive default concurrency. Salvaged from #12719 by @Bihruze — original PR shipped 12,289 LOC across 61 files (29 Python modules, FastAPI dashboard, VS Code extension, benchmark hub, marketplace, etc.) which was far beyond the scope of the underlying issue (#336). This version stays at the ~700-LOC scope that issue actually asked for. Authorship of the original effort credited via AUTHOR_MAP entry and the SKILL.md author field. Verified end-to-end: seed 'Say {{ phrase }}' (score 0.000) evolved into 'Please repeat the following phrase exactly as it is, without any modifications or additional formatting: {{ phrase }}' (score 0.750) across 3 iterations on gpt-4o-mini via OpenRouter. Co-authored-by: Bihruze <98262967+Bihruze@users.noreply.github.com>
This commit is contained in:
parent
e377833fa6
commit
c9b32a654c
5 changed files with 828 additions and 0 deletions
199
optional-skills/research/darwinian-evolver/SKILL.md
Normal file
199
optional-skills/research/darwinian-evolver/SKILL.md
Normal file
|
|
@ -0,0 +1,199 @@
|
|||
---
|
||||
name: darwinian-evolver
|
||||
description: Evolve prompts/regex/SQL/code with Imbue's evolution loop.
|
||||
version: 0.1.0
|
||||
author: Bihruze (Asahi0x), Hermes Agent
|
||||
license: MIT
|
||||
platforms: [linux, macos]
|
||||
metadata:
|
||||
hermes:
|
||||
tags: [evolution, optimization, prompt-engineering, research]
|
||||
related_skills: [arxiv, jupyter-live-kernel]
|
||||
---
|
||||
|
||||
# Darwinian Evolver
|
||||
|
||||
Run Imbue's [darwinian_evolver](https://github.com/imbue-ai/darwinian_evolver) — an
|
||||
LLM-driven evolutionary search loop — to optimize a **prompt, regex, SQL query,
|
||||
or small code snippet** against a fitness function.
|
||||
|
||||
Status: thin wrapper around the upstream tool. The skill installs it, walks the
|
||||
agent through writing a `Problem` definition (organism + evaluator + mutator),
|
||||
and drives the loop via the upstream CLI or a small custom Python driver.
|
||||
|
||||
**License:** the upstream tool is **AGPL-3.0**. The skill ONLY ever invokes it
|
||||
via the upstream CLI or a `subprocess`/`uv run` call (mere aggregation). Do NOT
|
||||
import upstream classes into Hermes itself.
|
||||
|
||||
## When to Use
|
||||
|
||||
- User says "optimize this prompt", "evolve a regex for X", "auto-improve this
|
||||
code/SQL", "search for a better instruction".
|
||||
- You have a scorer (exact match, regex pass-rate, unit test, LLM-judge, runtime
|
||||
metric) AND a starting candidate (organism). If you don't have a scorer, stop
|
||||
and define one first — that's the hard part.
|
||||
- Cost is OK: a typical run is 50–500 LLM calls. On gpt-4o-mini that's pennies;
|
||||
on Claude Sonnet it can be a few dollars.
|
||||
|
||||
Do **not** use this when:
|
||||
- The optimization target is differentiable (use gradient descent / DSPy).
|
||||
- You only need to try 2–3 variants — just write them by hand.
|
||||
- The fitness signal is purely subjective with no measurable criterion.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python ≥3.11
|
||||
- `git`, `uv` (or `pip`)
|
||||
- One of: `OPENROUTER_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY`
|
||||
|
||||
The skill ships a small `parrot_openrouter.py` driver that uses `OPENROUTER_API_KEY`
|
||||
via the OpenAI SDK, so any model on OpenRouter works. The upstream CLI itself
|
||||
hardcodes Anthropic and needs `ANTHROPIC_API_KEY`.
|
||||
|
||||
## Install (One-Time)
|
||||
|
||||
Run via the `terminal` tool:
|
||||
|
||||
```bash
|
||||
mkdir -p ~/.hermes/cache/darwinian-evolver && cd ~/.hermes/cache/darwinian-evolver
|
||||
[ -d darwinian_evolver ] || git clone --depth 1 https://github.com/imbue-ai/darwinian_evolver.git
|
||||
cd darwinian_evolver && uv sync
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver \
|
||||
&& uv run darwinian_evolver --help | head -5
|
||||
```
|
||||
|
||||
## Quick Start — The Built-In Parrot Example
|
||||
|
||||
Tiny smoke test (requires `ANTHROPIC_API_KEY`):
|
||||
|
||||
```bash
|
||||
cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver
|
||||
uv run darwinian_evolver parrot \
|
||||
--num_iterations 2 \
|
||||
--num_parents_per_iteration 2 \
|
||||
--mutator_concurrency 2 --evaluator_concurrency 2 \
|
||||
--output_dir /tmp/parrot_demo
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `/tmp/parrot_demo/snapshots/iteration_N.pkl` — pickled population per iteration
|
||||
- `/tmp/parrot_demo/<jsonl>` — per-iteration JSON log (path printed at end)
|
||||
|
||||
Open `~/.hermes/cache/darwinian-evolver/darwinian_evolver/darwinian_evolver/lineage_visualizer.html`
|
||||
in a browser and load the JSON log to see the evolutionary tree.
|
||||
|
||||
## Quick Start — OpenRouter Driver (No Anthropic Key)
|
||||
|
||||
The skill ships `scripts/parrot_openrouter.py` — same parrot problem, but the
|
||||
LLM call goes through OpenRouter so any provider works.
|
||||
|
||||
```bash
|
||||
# From wherever the skill is installed:
|
||||
SKILL_DIR=~/.hermes/skills/research/darwinian-evolver
|
||||
DE_DIR=~/.hermes/cache/darwinian-evolver/darwinian_evolver
|
||||
|
||||
cd "$DE_DIR" && \
|
||||
EVOLVER_MODEL='openai/gpt-4o-mini' \
|
||||
uv run --with openai python "$SKILL_DIR/scripts/parrot_openrouter.py" \
|
||||
--num_iterations 3 --num_parents_per_iteration 2 \
|
||||
--output_dir /tmp/parrot_or
|
||||
```
|
||||
|
||||
Inspect the result with `scripts/show_snapshot.py`:
|
||||
|
||||
```bash
|
||||
uv run --with openai python "$SKILL_DIR/scripts/show_snapshot.py" \
|
||||
/tmp/parrot_or/snapshots/iteration_3.pkl
|
||||
```
|
||||
|
||||
Expected output: 7 evolved prompt templates ranked by score, with the best
|
||||
landing around 0.6–0.8 (the seed `Say {{ phrase }}` scored 0.000).
|
||||
|
||||
## Defining a Custom Problem
|
||||
|
||||
The skill ships `templates/custom_problem_template.py` — copy, edit, run.
|
||||
Three things you must define:
|
||||
|
||||
1. **`Organism`** — a Pydantic `BaseModel` subclass holding the artifact being
|
||||
evolved (`prompt_template: str`, `regex_pattern: str`, `sql_query: str`,
|
||||
`code_block: str`, etc.). Add a `run(*args)` method that exercises it.
|
||||
|
||||
2. **`Evaluator`** — `.evaluate(organism) -> EvaluationResult(score=..., trainable_failure_cases=[...], holdout_failure_cases=[...], is_viable=True)`.
|
||||
- **`score`** is in `[0, 1]`. Higher is better.
|
||||
- **`trainable_failure_cases`** — what the mutator sees. Include enough
|
||||
context (input, expected, actual) for the LLM to diagnose.
|
||||
- **`holdout_failure_cases`** — kept out of the mutator's view. Use these
|
||||
to detect overfitting.
|
||||
- **`is_viable=True`** unless the organism is completely broken (raises,
|
||||
returns None, etc.). A 0-score viable organism is fine — it just gets
|
||||
down-weighted in parent selection.
|
||||
|
||||
3. **`Mutator`** — `.mutate(organism, failure_cases, learning_log_entries) -> list[Organism]`.
|
||||
Typically: build an LLM prompt that includes the current organism + a
|
||||
failure case + an ask to propose a fix; parse the LLM's response; return
|
||||
a new `Organism`. Return `[]` on parse failure — the loop handles it.
|
||||
|
||||
Then write a driver script that wires `Problem(initial_organism, evaluator, [mutators])`
|
||||
into `EvolveProblemLoop` and iterates over `loop.run(num_iterations=N)` — the
|
||||
shipped `scripts/parrot_openrouter.py` is the reference.
|
||||
|
||||
## Hyperparameters That Actually Matter
|
||||
|
||||
| flag | default | when to change |
|
||||
|---|---|---|
|
||||
| `--num_iterations` | 5 | bump to 10–20 once you trust the evaluator |
|
||||
| `--num_parents_per_iteration` | 4 | drop to 2 for cheap exploration |
|
||||
| `--mutator_concurrency` | 10 | drop to 2–4 to avoid rate limits |
|
||||
| `--evaluator_concurrency` | 10 | same; evaluator hits the LLM too |
|
||||
| `--batch_size` | 1 | raise to 3–5 once your mutator handles multiple failures |
|
||||
| `--verify_mutations` | off | turn on once mutator is wasteful (>10× cost saving on later runs per Imbue) |
|
||||
| `--midpoint_score` | `p75` | leave alone unless scores cluster |
|
||||
| `--sharpness` | 10 | leave alone |
|
||||
|
||||
## Pitfalls
|
||||
|
||||
1. **`Initial organism must be viable`** — set `is_viable=True` in your
|
||||
`EvaluationResult` even on a 0-score seed. The loop refuses non-viable
|
||||
organisms because they imply the loop has nothing to evolve from.
|
||||
2. **Provider content filters kill runs.** Azure-backed OpenRouter models
|
||||
reject phrases like "ignore previous instructions" with HTTP 400. Wrap
|
||||
the LLM call in `try/except` and return `f"<LLM_ERROR: {e}>"` — the
|
||||
evolver will just score that organism 0 and move on.
|
||||
3. **`loop.run()` is a generator** — calling it doesn't run anything until
|
||||
you iterate. Use `for snap in loop.run(num_iterations=N):`.
|
||||
4. **Snapshots are nested pickles.** `iteration_N.pkl` contains a dict with
|
||||
`population_snapshot` (more pickled bytes). To unpickle you must have the
|
||||
`Organism` class importable under the same dotted path it was pickled at.
|
||||
5. **Concurrency defaults are aggressive.** 10/10 will hit rate limits on
|
||||
most providers. Start with 2/2.
|
||||
6. **CLI is hardcoded to Anthropic.** `uv run darwinian_evolver <problem>`
|
||||
reaches for `ANTHROPIC_API_KEY` and uses Claude Sonnet. To use any other
|
||||
provider, write a driver like `parrot_openrouter.py`.
|
||||
7. **AGPL.** Never `from darwinian_evolver import ...` inside Hermes core.
|
||||
Custom driver scripts under `~/.hermes/skills/...` are user-side and fine.
|
||||
8. **No PyPI package.** `pip install darwinian-evolver` will pull the wrong
|
||||
thing. Always install from the GitHub repo.
|
||||
|
||||
## Verification
|
||||
|
||||
After install + a parrot run, exit code 0 from this is sufficient:
|
||||
|
||||
```bash
|
||||
DE_DIR=~/.hermes/cache/darwinian-evolver/darwinian_evolver
|
||||
ls "$DE_DIR/darwinian_evolver/lineage_visualizer.html" >/dev/null && \
|
||||
cd "$DE_DIR" && uv run darwinian_evolver --help >/dev/null && \
|
||||
echo "darwinian-evolver: OK"
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [Imbue research post](https://imbue.com/research/2026-02-27-darwinian-evolver/)
|
||||
- [ARC-AGI-2 results](https://imbue.com/research/2026-02-27-arc-agi-2-evolution/)
|
||||
- [imbue-ai/darwinian_evolver](https://github.com/imbue-ai/darwinian_evolver) (AGPL-3.0)
|
||||
- [Darwin Gödel Machines](https://arxiv.org/abs/2505.22954)
|
||||
- [PromptBreeder](https://arxiv.org/abs/2309.16797)
|
||||
|
|
@ -0,0 +1,218 @@
|
|||
"""
|
||||
parrot_openrouter: same as the upstream `parrot` example but the LLM call goes
|
||||
through OpenRouter (OpenAI SDK) instead of Anthropic native. Lets us run an
|
||||
end-to-end evolution with whatever model the user already has paid access to.
|
||||
|
||||
Run with:
|
||||
uv --project darwinian_evolver run python parrot_openrouter.py \
|
||||
--num_iterations 3 --output_dir /tmp/parrot_out
|
||||
|
||||
Reads `OPENROUTER_API_KEY` from the environment.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import jinja2
|
||||
from openai import OpenAI
|
||||
|
||||
# Vendored problem types from upstream (AGPL — only run via subprocess in production)
|
||||
from darwinian_evolver.cli_common import build_hyperparameter_config_from_args
|
||||
from darwinian_evolver.cli_common import register_hyperparameter_args
|
||||
from darwinian_evolver.cli_common import parse_learning_log_view_type
|
||||
from darwinian_evolver.evolve_problem_loop import EvolveProblemLoop
|
||||
from darwinian_evolver.learning_log import LearningLogEntry
|
||||
from darwinian_evolver.problem import EvaluationFailureCase
|
||||
from darwinian_evolver.problem import EvaluationResult
|
||||
from darwinian_evolver.problem import Evaluator
|
||||
from darwinian_evolver.problem import Mutator
|
||||
from darwinian_evolver.problem import Organism
|
||||
from darwinian_evolver.problem import Problem
|
||||
|
||||
DEFAULT_MODEL = os.environ.get("EVOLVER_MODEL", "openai/gpt-4o-mini")
|
||||
|
||||
|
||||
def _client() -> OpenAI:
|
||||
key = os.environ.get("OPENROUTER_API_KEY")
|
||||
if not key:
|
||||
sys.exit("OPENROUTER_API_KEY is not set")
|
||||
return OpenAI(api_key=key, base_url="https://openrouter.ai/api/v1")
|
||||
|
||||
|
||||
def _prompt_llm(prompt: str) -> str:
|
||||
try:
|
||||
r = _client().chat.completions.create(
|
||||
model=DEFAULT_MODEL,
|
||||
max_tokens=1024,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
)
|
||||
return r.choices[0].message.content or ""
|
||||
except Exception as e:
|
||||
# Treat any provider error (rate limit, content filter, schema reject)
|
||||
# as a failed response. The evolver will simply see this as a low score
|
||||
# on this organism and move on — much friendlier than killing the run.
|
||||
return f"<LLM_ERROR: {type(e).__name__}: {e}>"
|
||||
|
||||
|
||||
class ParrotOrganism(Organism):
|
||||
prompt_template: str
|
||||
|
||||
def run(self, phrase: str) -> str:
|
||||
try:
|
||||
prompt = jinja2.Template(self.prompt_template).render(phrase=phrase)
|
||||
except jinja2.exceptions.TemplateError as e:
|
||||
return f"Error rendering prompt: {e}"
|
||||
if not prompt:
|
||||
return ""
|
||||
return _prompt_llm(prompt)
|
||||
|
||||
|
||||
class ParrotEvaluationFailureCase(EvaluationFailureCase):
|
||||
phrase: str
|
||||
response: str
|
||||
|
||||
|
||||
class ImproveParrotMutator(Mutator[ParrotOrganism, ParrotEvaluationFailureCase]):
|
||||
IMPROVEMENT_PROMPT_TEMPLATE = """
|
||||
We want to build a prompt that causes an LLM to repeat back a given phrase verbatim.
|
||||
|
||||
The current prompt template is:
|
||||
```
|
||||
{{ organism.prompt_template }}
|
||||
```
|
||||
|
||||
Unfortunately, on this phrase:
|
||||
```
|
||||
{{ failure_case.phrase }}
|
||||
```
|
||||
the LLM responded with:
|
||||
```
|
||||
{{ failure_case.response }}
|
||||
```
|
||||
|
||||
Diagnose what went wrong, then propose an improved prompt template. Put the new
|
||||
template in the LAST triple-backtick block of your response.
|
||||
""".strip()
|
||||
|
||||
def mutate(
|
||||
self,
|
||||
organism: ParrotOrganism,
|
||||
failure_cases: list[ParrotEvaluationFailureCase],
|
||||
learning_log_entries: list[LearningLogEntry],
|
||||
) -> list[ParrotOrganism]:
|
||||
fc = failure_cases[0]
|
||||
prompt = jinja2.Template(self.IMPROVEMENT_PROMPT_TEMPLATE).render(
|
||||
organism=organism, failure_case=fc
|
||||
)
|
||||
try:
|
||||
resp = _prompt_llm(prompt)
|
||||
parts = resp.split("```")
|
||||
if len(parts) < 3:
|
||||
return []
|
||||
new_tpl = parts[-2].strip()
|
||||
return [ParrotOrganism(prompt_template=new_tpl)]
|
||||
except Exception as e:
|
||||
print(f"mutate error: {e}", file=sys.stderr)
|
||||
return []
|
||||
|
||||
|
||||
class ParrotEvaluator(Evaluator[ParrotOrganism, EvaluationResult, ParrotEvaluationFailureCase]):
|
||||
TRAINABLE_PHRASES = [
|
||||
"Hello world.",
|
||||
"bla",
|
||||
"Bla",
|
||||
"bla.",
|
||||
'"bla bla".',
|
||||
"Just say 'foo' once with no extra words.",
|
||||
]
|
||||
HOLDOUT_PHRASES = [
|
||||
"bla, but only once.",
|
||||
"'bla'",
|
||||
]
|
||||
|
||||
def evaluate(self, organism: ParrotOrganism) -> EvaluationResult:
|
||||
train_fails: list[ParrotEvaluationFailureCase] = []
|
||||
hold_fails: list[ParrotEvaluationFailureCase] = []
|
||||
for i, p in enumerate(self.TRAINABLE_PHRASES):
|
||||
r = organism.run(p)
|
||||
if r != p:
|
||||
train_fails.append(ParrotEvaluationFailureCase(
|
||||
phrase=p, response=r, data_point_id=f"trainable_{i}"))
|
||||
for i, p in enumerate(self.HOLDOUT_PHRASES):
|
||||
r = organism.run(p)
|
||||
if r != p:
|
||||
hold_fails.append(ParrotEvaluationFailureCase(
|
||||
phrase=p, response=r, data_point_id=f"holdout_{i}"))
|
||||
n_total = len(self.TRAINABLE_PHRASES) + len(self.HOLDOUT_PHRASES)
|
||||
n_ok = n_total - len(train_fails) - len(hold_fails)
|
||||
return EvaluationResult(
|
||||
score=n_ok / n_total,
|
||||
trainable_failure_cases=train_fails,
|
||||
holdout_failure_cases=hold_fails,
|
||||
# Always viable. Even a 0-score seed is a valid starting point; the
|
||||
# mutator should still get a chance to fix it.
|
||||
is_viable=True,
|
||||
)
|
||||
|
||||
|
||||
def make_problem() -> Problem:
|
||||
return Problem[ParrotOrganism, EvaluationResult, ParrotEvaluationFailureCase](
|
||||
evaluator=ParrotEvaluator(),
|
||||
mutators=[ImproveParrotMutator()],
|
||||
initial_organism=ParrotOrganism(prompt_template="Say {{ phrase }}"),
|
||||
)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser()
|
||||
register_hyperparameter_args(ap.add_argument_group("hyperparameters"))
|
||||
ap.add_argument("--num_iterations", type=int, default=3)
|
||||
ap.add_argument("--mutator_concurrency", type=int, default=4)
|
||||
ap.add_argument("--evaluator_concurrency", type=int, default=4)
|
||||
ap.add_argument("--output_dir", type=str, required=True)
|
||||
args = ap.parse_args()
|
||||
|
||||
out = Path(args.output_dir)
|
||||
out.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
hp = build_hyperparameter_config_from_args(args)
|
||||
loop = EvolveProblemLoop(
|
||||
problem=make_problem(),
|
||||
learning_log_view_type=parse_learning_log_view_type(hp.learning_log_view_type),
|
||||
num_parents_per_iteration=hp.num_parents_per_iteration,
|
||||
mutator_concurrency=args.mutator_concurrency,
|
||||
evaluator_concurrency=args.evaluator_concurrency,
|
||||
fixed_midpoint_score=hp.fixed_midpoint_score,
|
||||
midpoint_score_percentile=hp.midpoint_score_percentile,
|
||||
sharpness=hp.sharpness,
|
||||
novelty_weight=hp.novelty_weight,
|
||||
batch_size=hp.batch_size,
|
||||
should_verify_mutations=hp.verify_mutations,
|
||||
)
|
||||
|
||||
import json
|
||||
log_path = out / "results.jsonl"
|
||||
snap_dir = out / "snapshots"
|
||||
snap_dir.mkdir(exist_ok=True)
|
||||
print("Evaluating initial organism...")
|
||||
for snap in loop.run(num_iterations=args.num_iterations):
|
||||
(snap_dir / f"iteration_{snap.iteration}.pkl").write_bytes(snap.snapshot)
|
||||
_, best_eval = snap.best_organism_result
|
||||
print(f"iter={snap.iteration} pop={snap.population_size} "
|
||||
f"best_score={best_eval.score:.3f}")
|
||||
with log_path.open("a") as f:
|
||||
f.write(json.dumps({
|
||||
"iteration": snap.iteration,
|
||||
"best_score": best_eval.score,
|
||||
"pop_size": snap.population_size,
|
||||
"score_percentiles": {str(k): v for k, v in snap.score_percentiles.items()},
|
||||
}) + "\n")
|
||||
print(f"\nDone. Results in: {out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -0,0 +1,69 @@
|
|||
"""
|
||||
show_snapshot.py — Dump the population from a darwinian-evolver snapshot pickle.
|
||||
|
||||
Usage:
|
||||
python show_snapshot.py PATH/TO/iteration_N.pkl [--field prompt_template]
|
||||
|
||||
The script is intentionally Organism-agnostic: it walks `org.__dict__` and prints
|
||||
all str fields. By default it shows `prompt_template` if present; pass --field to
|
||||
target a different attribute (e.g. `regex_pattern`, `sql_query`, `code_block`).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import pickle
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("snapshot", type=Path)
|
||||
ap.add_argument(
|
||||
"--field",
|
||||
default=None,
|
||||
help="Organism attribute to display. Defaults to the first str field found.",
|
||||
)
|
||||
ap.add_argument("--top", type=int, default=None, help="Show only top N by score.")
|
||||
args = ap.parse_args()
|
||||
|
||||
if not args.snapshot.exists():
|
||||
sys.exit(f"snapshot not found: {args.snapshot}")
|
||||
|
||||
# The outer pickle wraps a dict; the inner pickle contains the actual organism
|
||||
# objects, which must be importable under their original dotted path. If you
|
||||
# ran a custom driver, make sure its module is on sys.path before calling this.
|
||||
outer = pickle.loads(args.snapshot.read_bytes())
|
||||
if not isinstance(outer, dict) or "population_snapshot" not in outer:
|
||||
sys.exit("not a darwinian-evolver snapshot (no population_snapshot key)")
|
||||
inner = pickle.loads(outer["population_snapshot"])
|
||||
pairs = inner["organisms"] # list of (Organism, EvaluationResult)
|
||||
|
||||
print(f"# organisms: {len(pairs)}\n")
|
||||
ranked = sorted(pairs, key=lambda p: getattr(p[1], "score", 0) or 0, reverse=True)
|
||||
if args.top:
|
||||
ranked = ranked[: args.top]
|
||||
|
||||
for i, (org, res) in enumerate(ranked):
|
||||
score = getattr(res, "score", float("nan"))
|
||||
print(f"=== rank {i} score={score:.3f} ===")
|
||||
# pick field
|
||||
field = args.field
|
||||
if field is None:
|
||||
for k, v in vars(org).items():
|
||||
if isinstance(v, str) and not k.startswith("_") and k not in ("id",):
|
||||
field = k
|
||||
break
|
||||
val = getattr(org, field, None) if field else None
|
||||
if val is None:
|
||||
print(f" (no string field; org fields: {list(vars(org).keys())})")
|
||||
else:
|
||||
print(f" {field} ({len(val)} chars):")
|
||||
for ln in val.splitlines()[:30]:
|
||||
print(f" {ln}")
|
||||
print()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
|
|
@ -0,0 +1,240 @@
|
|||
"""
|
||||
Template: a custom darwinian-evolver problem.
|
||||
|
||||
Copy this file, fill in the THREE marked spots (Organism, Evaluator, Mutator),
|
||||
then run it as a driver script. The skeleton handles all the wiring so you only
|
||||
write the domain-specific logic.
|
||||
|
||||
To run:
|
||||
cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver
|
||||
OPENROUTER_API_KEY=... uv run --with openai python /path/to/this_file.py \
|
||||
--num_iterations 3 --num_parents_per_iteration 2 \
|
||||
--output_dir /tmp/my_problem
|
||||
|
||||
The pattern mirrors `scripts/parrot_openrouter.py` (the working reference).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
from openai import OpenAI
|
||||
|
||||
# Upstream types (AGPL — invoked via subprocess in production; importing here
|
||||
# is fine for skill-side driver scripts the user owns).
|
||||
from darwinian_evolver.cli_common import (
|
||||
build_hyperparameter_config_from_args,
|
||||
parse_learning_log_view_type,
|
||||
register_hyperparameter_args,
|
||||
)
|
||||
from darwinian_evolver.evolve_problem_loop import EvolveProblemLoop
|
||||
from darwinian_evolver.learning_log import LearningLogEntry
|
||||
from darwinian_evolver.problem import (
|
||||
EvaluationFailureCase,
|
||||
EvaluationResult,
|
||||
Evaluator,
|
||||
Mutator,
|
||||
Organism,
|
||||
Problem,
|
||||
)
|
||||
|
||||
DEFAULT_MODEL = os.environ.get("EVOLVER_MODEL", "openai/gpt-4o-mini")
|
||||
|
||||
|
||||
def _client() -> OpenAI:
|
||||
key = os.environ.get("OPENROUTER_API_KEY")
|
||||
if not key:
|
||||
sys.exit("OPENROUTER_API_KEY is not set")
|
||||
return OpenAI(api_key=key, base_url="https://openrouter.ai/api/v1")
|
||||
|
||||
|
||||
def _prompt_llm(prompt: str, max_tokens: int = 1024) -> str:
|
||||
try:
|
||||
r = _client().chat.completions.create(
|
||||
model=DEFAULT_MODEL,
|
||||
max_tokens=max_tokens,
|
||||
messages=[{"role": "user", "content": prompt}],
|
||||
)
|
||||
return r.choices[0].message.content or ""
|
||||
except Exception as e:
|
||||
# Never let one bad LLM response kill the run.
|
||||
return f"<LLM_ERROR: {type(e).__name__}: {e}>"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 1. ORGANISM — what you are evolving.
|
||||
# ---------------------------------------------------------------------------
|
||||
class MyOrganism(Organism):
|
||||
# TODO: replace with your artifact field. Common shapes:
|
||||
# prompt_template: str
|
||||
# regex_pattern: str
|
||||
# sql_query: str
|
||||
# code_block: str
|
||||
artifact: str
|
||||
|
||||
def run(self, *inputs) -> str:
|
||||
"""Exercise the organism on a test input. Return whatever your
|
||||
evaluator wants to score."""
|
||||
# TODO: implement. For prompt evolution this typically calls _prompt_llm
|
||||
# with the artifact rendered against the input. For regex/SQL it would
|
||||
# call `re.findall(self.artifact, input)` / execute SQL / etc.
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 2. EVALUATOR — score organisms and surface failures the mutator can learn from.
|
||||
# ---------------------------------------------------------------------------
|
||||
class MyFailureCase(EvaluationFailureCase):
|
||||
# TODO: include enough context for the LLM to diagnose the failure.
|
||||
input: str
|
||||
expected: str
|
||||
actual: str
|
||||
|
||||
|
||||
class MyEvaluator(Evaluator[MyOrganism, EvaluationResult, MyFailureCase]):
|
||||
# Split your dataset. Mutator only sees trainable; holdout detects overfitting.
|
||||
TRAINABLE = [
|
||||
# TODO: list of (input, expected) tuples
|
||||
# ("input1", "expected1"),
|
||||
]
|
||||
HOLDOUT = [
|
||||
# TODO: separate set the mutator never sees
|
||||
]
|
||||
|
||||
def evaluate(self, organism: MyOrganism) -> EvaluationResult:
|
||||
train_fails: list[MyFailureCase] = []
|
||||
hold_fails: list[MyFailureCase] = []
|
||||
for i, (inp, expected) in enumerate(self.TRAINABLE):
|
||||
actual = organism.run(inp)
|
||||
if actual != expected:
|
||||
train_fails.append(MyFailureCase(
|
||||
input=inp, expected=expected, actual=actual,
|
||||
data_point_id=f"trainable_{i}",
|
||||
))
|
||||
for i, (inp, expected) in enumerate(self.HOLDOUT):
|
||||
actual = organism.run(inp)
|
||||
if actual != expected:
|
||||
hold_fails.append(MyFailureCase(
|
||||
input=inp, expected=expected, actual=actual,
|
||||
data_point_id=f"holdout_{i}",
|
||||
))
|
||||
n_total = len(self.TRAINABLE) + len(self.HOLDOUT)
|
||||
n_ok = n_total - len(train_fails) - len(hold_fails)
|
||||
return EvaluationResult(
|
||||
score=n_ok / n_total if n_total else 0.0,
|
||||
trainable_failure_cases=train_fails,
|
||||
holdout_failure_cases=hold_fails,
|
||||
# Always-viable. The evolver only blocks completely-broken organisms;
|
||||
# a 0-score organism is fine and will simply be sampled less often.
|
||||
is_viable=True,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# 3. MUTATOR — LLM proposes an improved organism from a failure case.
|
||||
# ---------------------------------------------------------------------------
|
||||
class MyMutator(Mutator[MyOrganism, MyFailureCase]):
|
||||
PROMPT = """
|
||||
The current artifact is:
|
||||
```
|
||||
{artifact}
|
||||
```
|
||||
|
||||
On this input:
|
||||
```
|
||||
{input}
|
||||
```
|
||||
it produced:
|
||||
```
|
||||
{actual}
|
||||
```
|
||||
but we wanted:
|
||||
```
|
||||
{expected}
|
||||
```
|
||||
|
||||
Diagnose what went wrong, then propose an improved version of the artifact.
|
||||
Put the new version in the LAST triple-backtick block of your response.
|
||||
""".strip()
|
||||
|
||||
def mutate(
|
||||
self,
|
||||
organism: MyOrganism,
|
||||
failure_cases: list[MyFailureCase],
|
||||
learning_log_entries: list[LearningLogEntry],
|
||||
) -> list[MyOrganism]:
|
||||
fc = failure_cases[0]
|
||||
prompt = self.PROMPT.format(
|
||||
artifact=organism.artifact,
|
||||
input=fc.input,
|
||||
actual=fc.actual,
|
||||
expected=fc.expected,
|
||||
)
|
||||
resp = _prompt_llm(prompt)
|
||||
parts = resp.split("```")
|
||||
if len(parts) < 3:
|
||||
return []
|
||||
new_artifact = parts[-2].strip()
|
||||
# Strip an opening language tag like "python\n" or "sql\n"
|
||||
if "\n" in new_artifact:
|
||||
first_line, rest = new_artifact.split("\n", 1)
|
||||
if first_line and not first_line.startswith(" ") and len(first_line) < 20:
|
||||
new_artifact = rest
|
||||
return [MyOrganism(artifact=new_artifact)]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Driver — fills in the EvolveProblemLoop boilerplate. You shouldn't need to
|
||||
# touch anything below this line for a typical run.
|
||||
# ---------------------------------------------------------------------------
|
||||
def make_problem() -> Problem:
|
||||
initial = MyOrganism(artifact="TODO: starting artifact here") # TODO
|
||||
return Problem[MyOrganism, EvaluationResult, MyFailureCase](
|
||||
evaluator=MyEvaluator(),
|
||||
mutators=[MyMutator()],
|
||||
initial_organism=initial,
|
||||
)
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser()
|
||||
register_hyperparameter_args(ap.add_argument_group("hyperparameters"))
|
||||
ap.add_argument("--num_iterations", type=int, default=3)
|
||||
ap.add_argument("--mutator_concurrency", type=int, default=2)
|
||||
ap.add_argument("--evaluator_concurrency", type=int, default=2)
|
||||
ap.add_argument("--output_dir", type=str, required=True)
|
||||
args = ap.parse_args()
|
||||
|
||||
out = Path(args.output_dir)
|
||||
out.mkdir(parents=True, exist_ok=True)
|
||||
(out / "snapshots").mkdir(exist_ok=True)
|
||||
|
||||
hp = build_hyperparameter_config_from_args(args)
|
||||
loop = EvolveProblemLoop(
|
||||
problem=make_problem(),
|
||||
learning_log_view_type=parse_learning_log_view_type(hp.learning_log_view_type),
|
||||
num_parents_per_iteration=hp.num_parents_per_iteration,
|
||||
mutator_concurrency=args.mutator_concurrency,
|
||||
evaluator_concurrency=args.evaluator_concurrency,
|
||||
fixed_midpoint_score=hp.fixed_midpoint_score,
|
||||
midpoint_score_percentile=hp.midpoint_score_percentile,
|
||||
sharpness=hp.sharpness,
|
||||
novelty_weight=hp.novelty_weight,
|
||||
batch_size=hp.batch_size,
|
||||
should_verify_mutations=hp.verify_mutations,
|
||||
)
|
||||
|
||||
print("Evaluating initial organism...")
|
||||
for snap in loop.run(num_iterations=args.num_iterations):
|
||||
(out / "snapshots" / f"iteration_{snap.iteration}.pkl").write_bytes(snap.snapshot)
|
||||
_, best = snap.best_organism_result
|
||||
print(f"iter={snap.iteration} pop={snap.population_size} best_score={best.score:.3f}")
|
||||
|
||||
print(f"\nDone. Results in: {out}")
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
Loading…
Add table
Add a link
Reference in a new issue