diff --git a/optional-skills/research/darwinian-evolver/SKILL.md b/optional-skills/research/darwinian-evolver/SKILL.md new file mode 100644 index 00000000000..272f6702481 --- /dev/null +++ b/optional-skills/research/darwinian-evolver/SKILL.md @@ -0,0 +1,199 @@ +--- +name: darwinian-evolver +description: Evolve prompts/regex/SQL/code with Imbue's evolution loop. +version: 0.1.0 +author: Bihruze (Asahi0x), Hermes Agent +license: MIT +platforms: [linux, macos] +metadata: + hermes: + tags: [evolution, optimization, prompt-engineering, research] + related_skills: [arxiv, jupyter-live-kernel] +--- + +# Darwinian Evolver + +Run Imbue's [darwinian_evolver](https://github.com/imbue-ai/darwinian_evolver) — an +LLM-driven evolutionary search loop — to optimize a **prompt, regex, SQL query, +or small code snippet** against a fitness function. + +Status: thin wrapper around the upstream tool. The skill installs it, walks the +agent through writing a `Problem` definition (organism + evaluator + mutator), +and drives the loop via the upstream CLI or a small custom Python driver. + +**License:** the upstream tool is **AGPL-3.0**. The skill ONLY ever invokes it +via the upstream CLI or a `subprocess`/`uv run` call (mere aggregation). Do NOT +import upstream classes into Hermes itself. + +## When to Use + +- User says "optimize this prompt", "evolve a regex for X", "auto-improve this + code/SQL", "search for a better instruction". +- You have a scorer (exact match, regex pass-rate, unit test, LLM-judge, runtime + metric) AND a starting candidate (organism). If you don't have a scorer, stop + and define one first — that's the hard part. +- Cost is OK: a typical run is 50–500 LLM calls. On gpt-4o-mini that's pennies; + on Claude Sonnet it can be a few dollars. + +Do **not** use this when: +- The optimization target is differentiable (use gradient descent / DSPy). +- You only need to try 2–3 variants — just write them by hand. +- The fitness signal is purely subjective with no measurable criterion. + +## Prerequisites + +- Python ≥3.11 +- `git`, `uv` (or `pip`) +- One of: `OPENROUTER_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY` + +The skill ships a small `parrot_openrouter.py` driver that uses `OPENROUTER_API_KEY` +via the OpenAI SDK, so any model on OpenRouter works. The upstream CLI itself +hardcodes Anthropic and needs `ANTHROPIC_API_KEY`. + +## Install (One-Time) + +Run via the `terminal` tool: + +```bash +mkdir -p ~/.hermes/cache/darwinian-evolver && cd ~/.hermes/cache/darwinian-evolver +[ -d darwinian_evolver ] || git clone --depth 1 https://github.com/imbue-ai/darwinian_evolver.git +cd darwinian_evolver && uv sync +``` + +Verify: + +```bash +cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver \ + && uv run darwinian_evolver --help | head -5 +``` + +## Quick Start — The Built-In Parrot Example + +Tiny smoke test (requires `ANTHROPIC_API_KEY`): + +```bash +cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver +uv run darwinian_evolver parrot \ + --num_iterations 2 \ + --num_parents_per_iteration 2 \ + --mutator_concurrency 2 --evaluator_concurrency 2 \ + --output_dir /tmp/parrot_demo +``` + +Outputs: +- `/tmp/parrot_demo/snapshots/iteration_N.pkl` — pickled population per iteration +- `/tmp/parrot_demo/` — per-iteration JSON log (path printed at end) + +Open `~/.hermes/cache/darwinian-evolver/darwinian_evolver/darwinian_evolver/lineage_visualizer.html` +in a browser and load the JSON log to see the evolutionary tree. + +## Quick Start — OpenRouter Driver (No Anthropic Key) + +The skill ships `scripts/parrot_openrouter.py` — same parrot problem, but the +LLM call goes through OpenRouter so any provider works. + +```bash +# From wherever the skill is installed: +SKILL_DIR=~/.hermes/skills/research/darwinian-evolver +DE_DIR=~/.hermes/cache/darwinian-evolver/darwinian_evolver + +cd "$DE_DIR" && \ + EVOLVER_MODEL='openai/gpt-4o-mini' \ + uv run --with openai python "$SKILL_DIR/scripts/parrot_openrouter.py" \ + --num_iterations 3 --num_parents_per_iteration 2 \ + --output_dir /tmp/parrot_or +``` + +Inspect the result with `scripts/show_snapshot.py`: + +```bash +uv run --with openai python "$SKILL_DIR/scripts/show_snapshot.py" \ + /tmp/parrot_or/snapshots/iteration_3.pkl +``` + +Expected output: 7 evolved prompt templates ranked by score, with the best +landing around 0.6–0.8 (the seed `Say {{ phrase }}` scored 0.000). + +## Defining a Custom Problem + +The skill ships `templates/custom_problem_template.py` — copy, edit, run. +Three things you must define: + +1. **`Organism`** — a Pydantic `BaseModel` subclass holding the artifact being + evolved (`prompt_template: str`, `regex_pattern: str`, `sql_query: str`, + `code_block: str`, etc.). Add a `run(*args)` method that exercises it. + +2. **`Evaluator`** — `.evaluate(organism) -> EvaluationResult(score=..., trainable_failure_cases=[...], holdout_failure_cases=[...], is_viable=True)`. + - **`score`** is in `[0, 1]`. Higher is better. + - **`trainable_failure_cases`** — what the mutator sees. Include enough + context (input, expected, actual) for the LLM to diagnose. + - **`holdout_failure_cases`** — kept out of the mutator's view. Use these + to detect overfitting. + - **`is_viable=True`** unless the organism is completely broken (raises, + returns None, etc.). A 0-score viable organism is fine — it just gets + down-weighted in parent selection. + +3. **`Mutator`** — `.mutate(organism, failure_cases, learning_log_entries) -> list[Organism]`. + Typically: build an LLM prompt that includes the current organism + a + failure case + an ask to propose a fix; parse the LLM's response; return + a new `Organism`. Return `[]` on parse failure — the loop handles it. + +Then write a driver script that wires `Problem(initial_organism, evaluator, [mutators])` +into `EvolveProblemLoop` and iterates over `loop.run(num_iterations=N)` — the +shipped `scripts/parrot_openrouter.py` is the reference. + +## Hyperparameters That Actually Matter + +| flag | default | when to change | +|---|---|---| +| `--num_iterations` | 5 | bump to 10–20 once you trust the evaluator | +| `--num_parents_per_iteration` | 4 | drop to 2 for cheap exploration | +| `--mutator_concurrency` | 10 | drop to 2–4 to avoid rate limits | +| `--evaluator_concurrency` | 10 | same; evaluator hits the LLM too | +| `--batch_size` | 1 | raise to 3–5 once your mutator handles multiple failures | +| `--verify_mutations` | off | turn on once mutator is wasteful (>10× cost saving on later runs per Imbue) | +| `--midpoint_score` | `p75` | leave alone unless scores cluster | +| `--sharpness` | 10 | leave alone | + +## Pitfalls + +1. **`Initial organism must be viable`** — set `is_viable=True` in your + `EvaluationResult` even on a 0-score seed. The loop refuses non-viable + organisms because they imply the loop has nothing to evolve from. +2. **Provider content filters kill runs.** Azure-backed OpenRouter models + reject phrases like "ignore previous instructions" with HTTP 400. Wrap + the LLM call in `try/except` and return `f""` — the + evolver will just score that organism 0 and move on. +3. **`loop.run()` is a generator** — calling it doesn't run anything until + you iterate. Use `for snap in loop.run(num_iterations=N):`. +4. **Snapshots are nested pickles.** `iteration_N.pkl` contains a dict with + `population_snapshot` (more pickled bytes). To unpickle you must have the + `Organism` class importable under the same dotted path it was pickled at. +5. **Concurrency defaults are aggressive.** 10/10 will hit rate limits on + most providers. Start with 2/2. +6. **CLI is hardcoded to Anthropic.** `uv run darwinian_evolver ` + reaches for `ANTHROPIC_API_KEY` and uses Claude Sonnet. To use any other + provider, write a driver like `parrot_openrouter.py`. +7. **AGPL.** Never `from darwinian_evolver import ...` inside Hermes core. + Custom driver scripts under `~/.hermes/skills/...` are user-side and fine. +8. **No PyPI package.** `pip install darwinian-evolver` will pull the wrong + thing. Always install from the GitHub repo. + +## Verification + +After install + a parrot run, exit code 0 from this is sufficient: + +```bash +DE_DIR=~/.hermes/cache/darwinian-evolver/darwinian_evolver +ls "$DE_DIR/darwinian_evolver/lineage_visualizer.html" >/dev/null && \ +cd "$DE_DIR" && uv run darwinian_evolver --help >/dev/null && \ +echo "darwinian-evolver: OK" +``` + +## References + +- [Imbue research post](https://imbue.com/research/2026-02-27-darwinian-evolver/) +- [ARC-AGI-2 results](https://imbue.com/research/2026-02-27-arc-agi-2-evolution/) +- [imbue-ai/darwinian_evolver](https://github.com/imbue-ai/darwinian_evolver) (AGPL-3.0) +- [Darwin Gödel Machines](https://arxiv.org/abs/2505.22954) +- [PromptBreeder](https://arxiv.org/abs/2309.16797) diff --git a/optional-skills/research/darwinian-evolver/scripts/parrot_openrouter.py b/optional-skills/research/darwinian-evolver/scripts/parrot_openrouter.py new file mode 100644 index 00000000000..545f8f1feb3 --- /dev/null +++ b/optional-skills/research/darwinian-evolver/scripts/parrot_openrouter.py @@ -0,0 +1,218 @@ +""" +parrot_openrouter: same as the upstream `parrot` example but the LLM call goes +through OpenRouter (OpenAI SDK) instead of Anthropic native. Lets us run an +end-to-end evolution with whatever model the user already has paid access to. + +Run with: + uv --project darwinian_evolver run python parrot_openrouter.py \ + --num_iterations 3 --output_dir /tmp/parrot_out + +Reads `OPENROUTER_API_KEY` from the environment. +""" +from __future__ import annotations + +import argparse +import os +import sys +from pathlib import Path + +import jinja2 +from openai import OpenAI + +# Vendored problem types from upstream (AGPL — only run via subprocess in production) +from darwinian_evolver.cli_common import build_hyperparameter_config_from_args +from darwinian_evolver.cli_common import register_hyperparameter_args +from darwinian_evolver.cli_common import parse_learning_log_view_type +from darwinian_evolver.evolve_problem_loop import EvolveProblemLoop +from darwinian_evolver.learning_log import LearningLogEntry +from darwinian_evolver.problem import EvaluationFailureCase +from darwinian_evolver.problem import EvaluationResult +from darwinian_evolver.problem import Evaluator +from darwinian_evolver.problem import Mutator +from darwinian_evolver.problem import Organism +from darwinian_evolver.problem import Problem + +DEFAULT_MODEL = os.environ.get("EVOLVER_MODEL", "openai/gpt-4o-mini") + + +def _client() -> OpenAI: + key = os.environ.get("OPENROUTER_API_KEY") + if not key: + sys.exit("OPENROUTER_API_KEY is not set") + return OpenAI(api_key=key, base_url="https://openrouter.ai/api/v1") + + +def _prompt_llm(prompt: str) -> str: + try: + r = _client().chat.completions.create( + model=DEFAULT_MODEL, + max_tokens=1024, + messages=[{"role": "user", "content": prompt}], + ) + return r.choices[0].message.content or "" + except Exception as e: + # Treat any provider error (rate limit, content filter, schema reject) + # as a failed response. The evolver will simply see this as a low score + # on this organism and move on — much friendlier than killing the run. + return f"" + + +class ParrotOrganism(Organism): + prompt_template: str + + def run(self, phrase: str) -> str: + try: + prompt = jinja2.Template(self.prompt_template).render(phrase=phrase) + except jinja2.exceptions.TemplateError as e: + return f"Error rendering prompt: {e}" + if not prompt: + return "" + return _prompt_llm(prompt) + + +class ParrotEvaluationFailureCase(EvaluationFailureCase): + phrase: str + response: str + + +class ImproveParrotMutator(Mutator[ParrotOrganism, ParrotEvaluationFailureCase]): + IMPROVEMENT_PROMPT_TEMPLATE = """ +We want to build a prompt that causes an LLM to repeat back a given phrase verbatim. + +The current prompt template is: +``` +{{ organism.prompt_template }} +``` + +Unfortunately, on this phrase: +``` +{{ failure_case.phrase }} +``` +the LLM responded with: +``` +{{ failure_case.response }} +``` + +Diagnose what went wrong, then propose an improved prompt template. Put the new +template in the LAST triple-backtick block of your response. +""".strip() + + def mutate( + self, + organism: ParrotOrganism, + failure_cases: list[ParrotEvaluationFailureCase], + learning_log_entries: list[LearningLogEntry], + ) -> list[ParrotOrganism]: + fc = failure_cases[0] + prompt = jinja2.Template(self.IMPROVEMENT_PROMPT_TEMPLATE).render( + organism=organism, failure_case=fc + ) + try: + resp = _prompt_llm(prompt) + parts = resp.split("```") + if len(parts) < 3: + return [] + new_tpl = parts[-2].strip() + return [ParrotOrganism(prompt_template=new_tpl)] + except Exception as e: + print(f"mutate error: {e}", file=sys.stderr) + return [] + + +class ParrotEvaluator(Evaluator[ParrotOrganism, EvaluationResult, ParrotEvaluationFailureCase]): + TRAINABLE_PHRASES = [ + "Hello world.", + "bla", + "Bla", + "bla.", + '"bla bla".', + "Just say 'foo' once with no extra words.", + ] + HOLDOUT_PHRASES = [ + "bla, but only once.", + "'bla'", + ] + + def evaluate(self, organism: ParrotOrganism) -> EvaluationResult: + train_fails: list[ParrotEvaluationFailureCase] = [] + hold_fails: list[ParrotEvaluationFailureCase] = [] + for i, p in enumerate(self.TRAINABLE_PHRASES): + r = organism.run(p) + if r != p: + train_fails.append(ParrotEvaluationFailureCase( + phrase=p, response=r, data_point_id=f"trainable_{i}")) + for i, p in enumerate(self.HOLDOUT_PHRASES): + r = organism.run(p) + if r != p: + hold_fails.append(ParrotEvaluationFailureCase( + phrase=p, response=r, data_point_id=f"holdout_{i}")) + n_total = len(self.TRAINABLE_PHRASES) + len(self.HOLDOUT_PHRASES) + n_ok = n_total - len(train_fails) - len(hold_fails) + return EvaluationResult( + score=n_ok / n_total, + trainable_failure_cases=train_fails, + holdout_failure_cases=hold_fails, + # Always viable. Even a 0-score seed is a valid starting point; the + # mutator should still get a chance to fix it. + is_viable=True, + ) + + +def make_problem() -> Problem: + return Problem[ParrotOrganism, EvaluationResult, ParrotEvaluationFailureCase]( + evaluator=ParrotEvaluator(), + mutators=[ImproveParrotMutator()], + initial_organism=ParrotOrganism(prompt_template="Say {{ phrase }}"), + ) + + +def main() -> int: + ap = argparse.ArgumentParser() + register_hyperparameter_args(ap.add_argument_group("hyperparameters")) + ap.add_argument("--num_iterations", type=int, default=3) + ap.add_argument("--mutator_concurrency", type=int, default=4) + ap.add_argument("--evaluator_concurrency", type=int, default=4) + ap.add_argument("--output_dir", type=str, required=True) + args = ap.parse_args() + + out = Path(args.output_dir) + out.mkdir(parents=True, exist_ok=True) + + hp = build_hyperparameter_config_from_args(args) + loop = EvolveProblemLoop( + problem=make_problem(), + learning_log_view_type=parse_learning_log_view_type(hp.learning_log_view_type), + num_parents_per_iteration=hp.num_parents_per_iteration, + mutator_concurrency=args.mutator_concurrency, + evaluator_concurrency=args.evaluator_concurrency, + fixed_midpoint_score=hp.fixed_midpoint_score, + midpoint_score_percentile=hp.midpoint_score_percentile, + sharpness=hp.sharpness, + novelty_weight=hp.novelty_weight, + batch_size=hp.batch_size, + should_verify_mutations=hp.verify_mutations, + ) + + import json + log_path = out / "results.jsonl" + snap_dir = out / "snapshots" + snap_dir.mkdir(exist_ok=True) + print("Evaluating initial organism...") + for snap in loop.run(num_iterations=args.num_iterations): + (snap_dir / f"iteration_{snap.iteration}.pkl").write_bytes(snap.snapshot) + _, best_eval = snap.best_organism_result + print(f"iter={snap.iteration} pop={snap.population_size} " + f"best_score={best_eval.score:.3f}") + with log_path.open("a") as f: + f.write(json.dumps({ + "iteration": snap.iteration, + "best_score": best_eval.score, + "pop_size": snap.population_size, + "score_percentiles": {str(k): v for k, v in snap.score_percentiles.items()}, + }) + "\n") + print(f"\nDone. Results in: {out}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/optional-skills/research/darwinian-evolver/scripts/show_snapshot.py b/optional-skills/research/darwinian-evolver/scripts/show_snapshot.py new file mode 100644 index 00000000000..10e3a03dca9 --- /dev/null +++ b/optional-skills/research/darwinian-evolver/scripts/show_snapshot.py @@ -0,0 +1,69 @@ +""" +show_snapshot.py — Dump the population from a darwinian-evolver snapshot pickle. + +Usage: + python show_snapshot.py PATH/TO/iteration_N.pkl [--field prompt_template] + +The script is intentionally Organism-agnostic: it walks `org.__dict__` and prints +all str fields. By default it shows `prompt_template` if present; pass --field to +target a different attribute (e.g. `regex_pattern`, `sql_query`, `code_block`). +""" +from __future__ import annotations + +import argparse +import pickle +import sys +from pathlib import Path + + +def main() -> int: + ap = argparse.ArgumentParser() + ap.add_argument("snapshot", type=Path) + ap.add_argument( + "--field", + default=None, + help="Organism attribute to display. Defaults to the first str field found.", + ) + ap.add_argument("--top", type=int, default=None, help="Show only top N by score.") + args = ap.parse_args() + + if not args.snapshot.exists(): + sys.exit(f"snapshot not found: {args.snapshot}") + + # The outer pickle wraps a dict; the inner pickle contains the actual organism + # objects, which must be importable under their original dotted path. If you + # ran a custom driver, make sure its module is on sys.path before calling this. + outer = pickle.loads(args.snapshot.read_bytes()) + if not isinstance(outer, dict) or "population_snapshot" not in outer: + sys.exit("not a darwinian-evolver snapshot (no population_snapshot key)") + inner = pickle.loads(outer["population_snapshot"]) + pairs = inner["organisms"] # list of (Organism, EvaluationResult) + + print(f"# organisms: {len(pairs)}\n") + ranked = sorted(pairs, key=lambda p: getattr(p[1], "score", 0) or 0, reverse=True) + if args.top: + ranked = ranked[: args.top] + + for i, (org, res) in enumerate(ranked): + score = getattr(res, "score", float("nan")) + print(f"=== rank {i} score={score:.3f} ===") + # pick field + field = args.field + if field is None: + for k, v in vars(org).items(): + if isinstance(v, str) and not k.startswith("_") and k not in ("id",): + field = k + break + val = getattr(org, field, None) if field else None + if val is None: + print(f" (no string field; org fields: {list(vars(org).keys())})") + else: + print(f" {field} ({len(val)} chars):") + for ln in val.splitlines()[:30]: + print(f" {ln}") + print() + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/optional-skills/research/darwinian-evolver/templates/custom_problem_template.py b/optional-skills/research/darwinian-evolver/templates/custom_problem_template.py new file mode 100644 index 00000000000..c6daac14ede --- /dev/null +++ b/optional-skills/research/darwinian-evolver/templates/custom_problem_template.py @@ -0,0 +1,240 @@ +""" +Template: a custom darwinian-evolver problem. + +Copy this file, fill in the THREE marked spots (Organism, Evaluator, Mutator), +then run it as a driver script. The skeleton handles all the wiring so you only +write the domain-specific logic. + +To run: + cd ~/.hermes/cache/darwinian-evolver/darwinian_evolver + OPENROUTER_API_KEY=... uv run --with openai python /path/to/this_file.py \ + --num_iterations 3 --num_parents_per_iteration 2 \ + --output_dir /tmp/my_problem + +The pattern mirrors `scripts/parrot_openrouter.py` (the working reference). +""" +from __future__ import annotations + +import argparse +import os +import sys +from pathlib import Path + +from openai import OpenAI + +# Upstream types (AGPL — invoked via subprocess in production; importing here +# is fine for skill-side driver scripts the user owns). +from darwinian_evolver.cli_common import ( + build_hyperparameter_config_from_args, + parse_learning_log_view_type, + register_hyperparameter_args, +) +from darwinian_evolver.evolve_problem_loop import EvolveProblemLoop +from darwinian_evolver.learning_log import LearningLogEntry +from darwinian_evolver.problem import ( + EvaluationFailureCase, + EvaluationResult, + Evaluator, + Mutator, + Organism, + Problem, +) + +DEFAULT_MODEL = os.environ.get("EVOLVER_MODEL", "openai/gpt-4o-mini") + + +def _client() -> OpenAI: + key = os.environ.get("OPENROUTER_API_KEY") + if not key: + sys.exit("OPENROUTER_API_KEY is not set") + return OpenAI(api_key=key, base_url="https://openrouter.ai/api/v1") + + +def _prompt_llm(prompt: str, max_tokens: int = 1024) -> str: + try: + r = _client().chat.completions.create( + model=DEFAULT_MODEL, + max_tokens=max_tokens, + messages=[{"role": "user", "content": prompt}], + ) + return r.choices[0].message.content or "" + except Exception as e: + # Never let one bad LLM response kill the run. + return f"" + + +# --------------------------------------------------------------------------- +# 1. ORGANISM — what you are evolving. +# --------------------------------------------------------------------------- +class MyOrganism(Organism): + # TODO: replace with your artifact field. Common shapes: + # prompt_template: str + # regex_pattern: str + # sql_query: str + # code_block: str + artifact: str + + def run(self, *inputs) -> str: + """Exercise the organism on a test input. Return whatever your + evaluator wants to score.""" + # TODO: implement. For prompt evolution this typically calls _prompt_llm + # with the artifact rendered against the input. For regex/SQL it would + # call `re.findall(self.artifact, input)` / execute SQL / etc. + raise NotImplementedError + + +# --------------------------------------------------------------------------- +# 2. EVALUATOR — score organisms and surface failures the mutator can learn from. +# --------------------------------------------------------------------------- +class MyFailureCase(EvaluationFailureCase): + # TODO: include enough context for the LLM to diagnose the failure. + input: str + expected: str + actual: str + + +class MyEvaluator(Evaluator[MyOrganism, EvaluationResult, MyFailureCase]): + # Split your dataset. Mutator only sees trainable; holdout detects overfitting. + TRAINABLE = [ + # TODO: list of (input, expected) tuples + # ("input1", "expected1"), + ] + HOLDOUT = [ + # TODO: separate set the mutator never sees + ] + + def evaluate(self, organism: MyOrganism) -> EvaluationResult: + train_fails: list[MyFailureCase] = [] + hold_fails: list[MyFailureCase] = [] + for i, (inp, expected) in enumerate(self.TRAINABLE): + actual = organism.run(inp) + if actual != expected: + train_fails.append(MyFailureCase( + input=inp, expected=expected, actual=actual, + data_point_id=f"trainable_{i}", + )) + for i, (inp, expected) in enumerate(self.HOLDOUT): + actual = organism.run(inp) + if actual != expected: + hold_fails.append(MyFailureCase( + input=inp, expected=expected, actual=actual, + data_point_id=f"holdout_{i}", + )) + n_total = len(self.TRAINABLE) + len(self.HOLDOUT) + n_ok = n_total - len(train_fails) - len(hold_fails) + return EvaluationResult( + score=n_ok / n_total if n_total else 0.0, + trainable_failure_cases=train_fails, + holdout_failure_cases=hold_fails, + # Always-viable. The evolver only blocks completely-broken organisms; + # a 0-score organism is fine and will simply be sampled less often. + is_viable=True, + ) + + +# --------------------------------------------------------------------------- +# 3. MUTATOR — LLM proposes an improved organism from a failure case. +# --------------------------------------------------------------------------- +class MyMutator(Mutator[MyOrganism, MyFailureCase]): + PROMPT = """ +The current artifact is: +``` +{artifact} +``` + +On this input: +``` +{input} +``` +it produced: +``` +{actual} +``` +but we wanted: +``` +{expected} +``` + +Diagnose what went wrong, then propose an improved version of the artifact. +Put the new version in the LAST triple-backtick block of your response. +""".strip() + + def mutate( + self, + organism: MyOrganism, + failure_cases: list[MyFailureCase], + learning_log_entries: list[LearningLogEntry], + ) -> list[MyOrganism]: + fc = failure_cases[0] + prompt = self.PROMPT.format( + artifact=organism.artifact, + input=fc.input, + actual=fc.actual, + expected=fc.expected, + ) + resp = _prompt_llm(prompt) + parts = resp.split("```") + if len(parts) < 3: + return [] + new_artifact = parts[-2].strip() + # Strip an opening language tag like "python\n" or "sql\n" + if "\n" in new_artifact: + first_line, rest = new_artifact.split("\n", 1) + if first_line and not first_line.startswith(" ") and len(first_line) < 20: + new_artifact = rest + return [MyOrganism(artifact=new_artifact)] + + +# --------------------------------------------------------------------------- +# Driver — fills in the EvolveProblemLoop boilerplate. You shouldn't need to +# touch anything below this line for a typical run. +# --------------------------------------------------------------------------- +def make_problem() -> Problem: + initial = MyOrganism(artifact="TODO: starting artifact here") # TODO + return Problem[MyOrganism, EvaluationResult, MyFailureCase]( + evaluator=MyEvaluator(), + mutators=[MyMutator()], + initial_organism=initial, + ) + + +def main() -> int: + ap = argparse.ArgumentParser() + register_hyperparameter_args(ap.add_argument_group("hyperparameters")) + ap.add_argument("--num_iterations", type=int, default=3) + ap.add_argument("--mutator_concurrency", type=int, default=2) + ap.add_argument("--evaluator_concurrency", type=int, default=2) + ap.add_argument("--output_dir", type=str, required=True) + args = ap.parse_args() + + out = Path(args.output_dir) + out.mkdir(parents=True, exist_ok=True) + (out / "snapshots").mkdir(exist_ok=True) + + hp = build_hyperparameter_config_from_args(args) + loop = EvolveProblemLoop( + problem=make_problem(), + learning_log_view_type=parse_learning_log_view_type(hp.learning_log_view_type), + num_parents_per_iteration=hp.num_parents_per_iteration, + mutator_concurrency=args.mutator_concurrency, + evaluator_concurrency=args.evaluator_concurrency, + fixed_midpoint_score=hp.fixed_midpoint_score, + midpoint_score_percentile=hp.midpoint_score_percentile, + sharpness=hp.sharpness, + novelty_weight=hp.novelty_weight, + batch_size=hp.batch_size, + should_verify_mutations=hp.verify_mutations, + ) + + print("Evaluating initial organism...") + for snap in loop.run(num_iterations=args.num_iterations): + (out / "snapshots" / f"iteration_{snap.iteration}.pkl").write_bytes(snap.snapshot) + _, best = snap.best_organism_result + print(f"iter={snap.iteration} pop={snap.population_size} best_score={best.score:.3f}") + + print(f"\nDone. Results in: {out}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tests/skills/test_darwinian_evolver_skill.py b/tests/skills/test_darwinian_evolver_skill.py new file mode 100644 index 00000000000..8b3a14b8da9 --- /dev/null +++ b/tests/skills/test_darwinian_evolver_skill.py @@ -0,0 +1,102 @@ +""" +Smoke tests for the darwinian-evolver optional skill. + +We can't actually run the evolution loop in CI (it needs network + a paid LLM), +so these tests verify: + - SKILL.md frontmatter conforms to the hardline format + - shipped scripts parse as valid Python + - the scripts reference the right env var / module paths +""" +from __future__ import annotations + +import ast +import re +from pathlib import Path + +import pytest +import yaml + +SKILL_DIR = Path(__file__).resolve().parents[2] / "optional-skills" / "research" / "darwinian-evolver" + + +@pytest.fixture(scope="module") +def frontmatter() -> dict: + src = (SKILL_DIR / "SKILL.md").read_text() + m = re.search(r"^---\n(.*?)\n---", src, re.DOTALL) + assert m, "SKILL.md missing YAML frontmatter" + return yaml.safe_load(m.group(1)) + + +def test_skill_dir_exists() -> None: + assert SKILL_DIR.is_dir(), f"missing skill dir: {SKILL_DIR}" + + +def test_skill_md_present() -> None: + assert (SKILL_DIR / "SKILL.md").is_file() + + +def test_description_under_60_chars(frontmatter) -> None: + desc = frontmatter["description"] + assert len(desc) <= 60, f"description is {len(desc)} chars (hardline ≤60): {desc!r}" + + +def test_name_matches_dir(frontmatter) -> None: + assert frontmatter["name"] == "darwinian-evolver" + + +def test_platforms_excludes_windows(frontmatter) -> None: + # Upstream uses func_timeout (POSIX signals) and uv subprocess pipelines; the + # skill is gated [linux, macos]. If we ever port to Windows, update this test + # to assert ["linux", "macos", "windows"]. + assert "windows" not in frontmatter["platforms"] + assert set(frontmatter["platforms"]) >= {"linux", "macos"} + + +def test_author_credits_contributor(frontmatter) -> None: + author = frontmatter["author"] + assert "Bihruze" in author, f"author should credit the original contributor: {author!r}" + + +def test_license_mit(frontmatter) -> None: + assert frontmatter["license"] == "MIT" + + +@pytest.mark.parametrize( + "path", + [ + "scripts/parrot_openrouter.py", + "scripts/show_snapshot.py", + "templates/custom_problem_template.py", + ], +) +def test_shipped_scripts_parse(path: str) -> None: + src = (SKILL_DIR / path).read_text() + ast.parse(src) # raises SyntaxError on broken Python + + +def test_parrot_script_uses_openrouter() -> None: + src = (SKILL_DIR / "scripts" / "parrot_openrouter.py").read_text() + assert "OPENROUTER_API_KEY" in src, "parrot driver should read OPENROUTER_API_KEY" + assert "openrouter.ai/api/v1" in src, "parrot driver should target OpenRouter" + assert "EVOLVER_MODEL" in src, "model should be overridable via EVOLVER_MODEL" + + +def test_parrot_script_has_error_swallowing() -> None: + """Provider content-filter / rate-limit must not kill the run — see Pitfall 2.""" + src = (SKILL_DIR / "scripts" / "parrot_openrouter.py").read_text() + assert "LLM_ERROR" in src, "_prompt_llm should swallow provider errors and tag them" + + +def test_skill_calls_out_agpl(frontmatter) -> None: + """The upstream tool is AGPL-3.0. The skill MUST flag this so users don't + import it into MIT-licensed code by accident.""" + src = (SKILL_DIR / "SKILL.md").read_text() + assert "AGPL" in src, "SKILL.md must mention upstream AGPL license" + + +def test_skill_pitfalls_section_present() -> None: + src = (SKILL_DIR / "SKILL.md").read_text() + assert "## Pitfalls" in src + # Pitfalls we discovered during the spike — keep them in sync with reality. + assert "Initial organism must be viable" in src + assert "generator" in src # loop.run() pitfall