From 762b582704ec396fe7b3307cc224ab68a99b81ea Mon Sep 17 00:00:00 2001
From: Howard Li <howard@oysterlabs.ai>
Date: Sat, 11 Apr 2026 12:28:41 -0700
Subject: [PATCH 1/2] feat(skills): add autoresearch skill for autonomous
 research orchestration

Add a new research skill that enables continuous, self-directed research
with a two-loop architecture:

- Inner loop: Rapid experiment iteration with measurable outcomes
- Outer loop: Periodic synthesis, pattern discovery, and direction setting

Features:
- Research workspace templates (state, findings, log)
- Example project (LoRA rank study)
- Configuration options for loop intervals and auto-commit
- Integration with existing research skills (arxiv, paper-writing)

Updated research/DESCRIPTION.md to include autoresearch in the skill overview.
---
 skills/research/DESCRIPTION.md                |  30 ++-
 skills/research/autoresearch/SKILL.md         | 200 ++++++++++++++++++
 .../autoresearch/templates/findings.md        |  93 ++++++++
 .../autoresearch/templates/research-log.md    |  90 ++++++++
 .../templates/research-state.yaml             |  70 ++++++
 5 files changed, 482 insertions(+), 1 deletion(-)
 create mode 100644 skills/research/autoresearch/SKILL.md
 create mode 100644 skills/research/autoresearch/templates/findings.md
 create mode 100644 skills/research/autoresearch/templates/research-log.md
 create mode 100644 skills/research/autoresearch/templates/research-state.yaml

diff --git a/skills/research/DESCRIPTION.md b/skills/research/DESCRIPTION.md
index a54c16906..775b49e2e 100644
--- a/skills/research/DESCRIPTION.md
+++ b/skills/research/DESCRIPTION.md
@@ -1,3 +1,31 @@
 ---
-description: Skills for academic research, paper discovery, literature review, domain reconnaissance, market data, content monitoring, and scientific knowledge retrieval.
+description: Skills for academic research, paper discovery, literature review, domain reconnaissance, market data, content monitoring, and scientific knowledge retrieval. Includes autoresearch for autonomous, continuous research with iterative experimentation.
 ---
+
+## Skill Overview
+
+| Skill | Purpose | Best For |
+|-------|---------|----------|
+| `autoresearch` | Autonomous research orchestration | Continuous experimentation, hypothesis testing, benchmark optimization |
+| `arxiv` | Search academic papers | Literature surveys, paper discovery |
+| `research-paper-writing` | Write publication-ready papers | Final paper generation |
+| `blogwatcher` | Monitor research blogs | Staying current with new developments |
+| `llm-wiki` | LLM knowledge base | Quick reference on models and techniques |
+| `polymarket` | Prediction market data | Research on market trends and predictions |
+
+## Getting Started with Research
+
+**Quick literature search:**
+```
+/arxiv "transformer attention mechanisms"
+```
+
+**Start autonomous research:**
+```
+/autoresearch "Does LoRA rank affect convergence speed?"
+```
+
+**Write a paper:**
+```
+/research-paper-writing
+```
diff --git a/skills/research/autoresearch/SKILL.md b/skills/research/autoresearch/SKILL.md
new file mode 100644
index 000000000..7ef674c45
--- /dev/null
+++ b/skills/research/autoresearch/SKILL.md
@@ -0,0 +1,200 @@
+---
+name: autoresearch
+description: Autonomous research orchestration for AI coding agents. Run continuous, self-directed research with a two-loop architecture — rapid inner-loop experiments and periodic outer-loop synthesis. Ideal for literature surveys, hypothesis testing, benchmark optimization, and iterative discovery. No human hand-holding required.
+version: 1.0.0
+author: Hermes Agent
+license: MIT
+metadata:
+  hermes:
+    tags: [Research, Autonomous, Experiments, ML, AI, Literature, Hypothesis, Benchmark, Optimization]
+    related_skills: [arxiv, research-paper-writing, web-search, notebooklm]
+    config:
+      autoresearch.loop_interval_minutes:
+        description: "Interval between autonomous research loops (in minutes)"
+        default: 20
+      autoresearch.max_iterations:
+        description: "Maximum number of inner-loop experiments before forced reflection"
+        default: 10
+      autoresearch.auto_commit:
+        description: "Automatically git-commit research milestones"
+        default: true
+---
+
+# Autoresearch
+
+**Autonomous research orchestration for AI coding agents.**
+
+Run continuous, self-directed research with a two-loop architecture:
+- **Inner Loop**: Rapid experiment iteration with clear measurable outcomes
+- **Outer Loop**: Periodic synthesis, pattern discovery, and direction setting
+
+Ideal for literature surveys, hypothesis testing, benchmark optimization, mechanistic interpretability studies, and any research requiring iterative experimentation.
+
+## When to Use
+
+| Scenario | Use Autoresearch? |
+|----------|-------------------|
+| "I want to explore X and see what works" | ✅ Yes |
+| "Does technique Y improve metric Z?" | ✅ Yes |
+| "What's the state of the art for problem W?" | ✅ Yes (bootstrap + literature) |
+| "Train a model with specific hyperparameters" | ❌ Use domain skills directly |
+| "Run a single evaluation" | ❌ Use evaluation skills directly |
+
+## Quick Start
+
+```bash
+# Start a research project
+/autoresearch "Does LoRA rank affect convergence speed on small datasets?"
+
+# Or with the research tool
+research_init(project="lora-rank-study", question="Does LoRA rank affect convergence speed?")
+```
+
+## The Two-Loop Architecture
+
+```
+BOOTSTRAP (once)
+  ↓
+INNER LOOP (fast, repeating) → Run experiments → Measure → Record → Learn
+  ↓ (every N experiments or when stuck)
+OUTER LOOP (reflective) → Synthesize → New hypotheses → Decide direction
+  ↓
+CONCLUDE → Write findings → Generate report
+```
+
+### Inner Loop: Experiment Fast
+
+1. Pick highest-priority untested hypothesis
+2. Write protocol (what change, what prediction, why)
+3. **Lock it**: Commit to git BEFORE running
+4. Run experiment (invoke domain skill)
+5. Sanity check results (converged? baseline correct?)
+6. Measure proxy metric
+7. Record in `experiments/{hypothesis-slug}/`
+8. Update `research-state.yaml`
+9. If stuck → search literature or brainstorm
+
+### Outer Loop: Step Back and Synthesize
+
+1. Review all results since last reflection
+2. Cluster by type: what worked? what didn't?
+3. Ask WHY — identify mechanisms
+4. Update `findings.md` with current understanding
+5. Search literature if results surprise you
+6. Generate new hypotheses if warranted
+7. Decide direction: DEEPEN / BROADEN / PIVOT / CONCLUDE
+
+## Workspace Structure
+
+```
+{project}/
+├── research-state.yaml       # Central state tracking
+├── research-log.md           # Decision timeline
+├── findings.md               # Evolving narrative synthesis
+├── literature/               # Papers, survey notes
+├── src/                      # Reusable code (utils, plotting)
+├── data/                     # Raw result data
+├── experiments/              # Per-hypothesis work
+│   └── {hypothesis-slug}/
+│       ├── protocol.md       # What, why, and prediction
+│       ├── code/             # Experiment-specific code
+│       ├── results/          # Raw outputs, metrics
+│       └── analysis.md       # What we learned
+├── to_human/                 # Progress presentations
+└── paper/                    # Final paper (optional)
+```
+
+## Research Discipline
+
+### Lock Before You Run
+
+Always commit your protocol to git BEFORE executing:
+
+```bash
+git add experiments/H001-protocol.md
+git commit -m "research(protocol): H001 — cosine warmup improves convergence"
+# THEN run the experiment
+```
+
+This creates temporal proof your plan existed before results.
+
+### Confirmatory vs Exploratory
+
+| Type | Definition | Trust Level |
+|------|------------|-------------|
+| **Confirmatory** | Matches your locked protocol | High |
+| **Exploratory** | Discovered during execution | Medium — needs replication |
+
+### Negative Results Are Progress
+
+A refuted hypothesis tells you something. Log what it rules out and what it suggests.
+
+## Commands
+
+| Command | Description |
+|---------|-------------|
+| `/autoresearch <question>` | Initialize and start research project |
+| `/research-status` | Show current state and progress |
+| `/research-pause` | Pause autonomous loops |
+| `/research-resume` | Resume autonomous loops |
+| `/research-report` | Generate progress presentation |
+| `/research-conclude` | Finalize and write paper |
+
+## Configuration
+
+Add to `~/.hermes/config.yaml`:
+
+```yaml
+autoresearch:
+  loop_interval_minutes: 20      # How often to check progress
+  max_iterations: 10             # Experiments before forced reflection
+  auto_commit: true              # Auto-commit milestones
+  default_workspace: "./research" # Where to create projects
+```
+
+## Integration with Other Skills
+
+| Research Phase | Skills to Invoke |
+|----------------|------------------|
+| Literature search | `arxiv`, `web-search`, `notebooklm` |
+| Data preparation | `data-science` tools |
+| Model training | `mlops`, domain-specific skills |
+| Evaluation | `evaluating-llms-harness`, custom evals |
+| Paper writing | `research-paper-writing` |
+| Progress reports | Built-in report generation |
+
+## Example: LoRA Rank Study
+
+```
+User: /autoresearch "Does LoRA rank affect convergence speed on small datasets?"
+
+Agent: 
+1. Bootstraps: Searches arxiv for LoRA papers
+2. Forms hypotheses: H1 (rank 4), H2 (rank 8), H3 (rank 16)
+3. Inner loop: Trains 3 models, records convergence steps
+4. Outer loop: Notices rank 8 converges fastest
+5. Deepens: Tests rank 6, 10, 12
+6. Concludes: Generates report with trajectory plot
+```
+
+## Best Practices
+
+1. **Start simple**: First experiment should run in <30 minutes
+2. **Define metrics upfront**: Lock evaluation criteria before running
+3. **Return to literature**: When stuck or surprised, search papers
+4. **Commit frequently**: Git history is your research log
+5. **Show your work**: Generate progress reports for human review
+6. **Never idle**: If blocked, diagnose, fix, or pivot — but keep moving
+
+## References
+
+- Inspired by Andrej Karpathy's autoresearch methodology
+- Compatible with agentskills.io open standard
+- Built-in templates from `templates/` directory
+
+## See Also
+
+- `templates/research-state.yaml` — State tracking template
+- `templates/findings.md` — Synthesis template
+- `templates/research-log.md` — Decision log template
+- `examples/` — Example research projects
diff --git a/skills/research/autoresearch/templates/findings.md b/skills/research/autoresearch/templates/findings.md
new file mode 100644
index 000000000..9f2cae034
--- /dev/null
+++ b/skills/research/autoresearch/templates/findings.md
@@ -0,0 +1,93 @@
+# Findings: {{PROJECT_NAME}}
+
+**Research Question:** {{RESEARCH_QUESTION}}
+
+**Last Updated:** {{LAST_UPDATED}}
+
+---
+
+## Current Understanding
+
+### What We Know So Far
+
+[Summarize the current state of knowledge. 2-4 paragraphs.]
+
+### Key Patterns and Insights
+
+| Pattern | Evidence | Confidence |
+|---------|----------|------------|
+| [Pattern 1] | [Which experiments support this] | High/Medium/Low |
+| [Pattern 2] | [Which experiments support this] | High/Medium/Low |
+
+### Mechanistic Understanding
+
+[If applicable: What mechanisms explain the results?]
+
+---
+
+## Lessons and Constraints
+
+### What Works
+
+- [Specific finding with context]
+- [Another finding]
+
+### What Doesn't Work
+
+- [Failed approach and why]
+- [Constraint discovered]
+
+### Critical Parameters
+
+| Parameter | Sweet Spot | Why |
+|-----------|------------|-----|
+| [Param 1] | [Value/range] | [Explanation] |
+
+---
+
+## Open Questions
+
+### High Priority
+
+1. [Question that would change the story if answered]
+2. [Another critical question]
+
+### Medium Priority
+
+1. [Nice to know but not blocking]
+
+### Answered
+
+1. ~~[Question]~~ → Answer: [Brief answer with evidence]
+
+---
+
+## Narrative Arc
+
+[For paper writing: What's the story? What would the abstract say?]
+
+### Contribution Sketch
+
+[1-2 sentences on what this research contributes]
+
+### Implications
+
+[Who cares? Why does this matter?]
+
+---
+
+## Next Steps
+
+### Immediate (Next 1-3 Experiments)
+
+- [ ] [Specific experiment]
+- [ ] [Another experiment]
+
+### Medium Term
+
+- [ ] [Broader direction]
+
+### If Current Direction Fails
+
+- [Pivot option 1]
+- [Pivot option 2]
diff --git a/skills/research/autoresearch/templates/research-log.md b/skills/research/autoresearch/templates/research-log.md
new file mode 100644
index 000000000..ad6faae65
--- /dev/null
+++ b/skills/research/autoresearch/templates/research-log.md
@@ -0,0 +1,90 @@
+# Research Log: {{PROJECT_NAME}}
+
+**Research Question:** {{RESEARCH_QUESTION}}
+
+---
+
+## Bootstrap Phase
+
+### {{DATE}} — Project Initialization
+
+- **Action:** Created workspace, initialized state files
+- **Research Question:** {{RESEARCH_QUESTION}}
+- **Initial Thoughts:** [What makes this interesting?]
+
+### {{DATE}} — Literature Search
+
+- **Sources:** arxiv, semantic scholar, web search
+- **Key Papers:**
+  - [Paper 1] — [Key finding relevant to question]
+  - [Paper 2] — [Key finding]
+- **Gap Identified:** [What's missing in existing work?]
+
+### {{DATE}} — Hypothesis Formation
+
+- **H001:** [Description] → Prediction: [Specific prediction]
+- **H002:** [Description] → Prediction: [Specific prediction]
+- **H003:** [Description] → Prediction: [Specific prediction]
+
+---
+
+## Inner Loop Log
+
+### {{DATE}} — Experiment H001
+
+- **Hypothesis:** H001
+- **Protocol:** [What was changed, what was predicted]
+- **Git Commit:** `research(protocol): H001 — [description]`
+- **Status:** [Running/Completed/Failed]
+- **Results:**
+  - Metric: [Value]
+  - Baseline: [Value]
+  - Delta: [+/-X]
+- **Interpretation:** [What this means]
+- **Next Action:** [Continue/Adjust/Pivot]
+
+### {{DATE}} — Experiment H002
+
+[Same format...]
+
+---
+
+## Outer Loop Log
+
+### {{DATE}} — Reflection #1 (After {{N}} experiments)
+
+- **Experiments Reviewed:** H001-H00N
+- **Patterns Observed:**
+  - [Pattern 1]
+  - [Pattern 2]
+- **Updated Understanding:** [New insights]
+- **Direction Decision:** [DEEPEN/BROADEN/PIVOT/CONCLUDE]
+- **Rationale:** [Why this direction?]
+- **New Hypotheses:**
+  - H00N+1: [Description]
+  - H00N+2: [Description]
+
+---
+
+## Direction Changes
+
+### {{DATE}} — PIVOT: [New Direction]
+
+- **From:** [Old direction/assumption]
+- **To:** [New direction]
+- **Trigger:** [What result/surprise caused this?]
+- **New Research Question:** [If changed]
+
+---
+
+## Conclusion
+
+### {{DATE}} — Research Concluded
+
+- **Final Status:** [Completed/Partial/Abandoned]
+- **Key Findings:**
+  1. [Finding 1]
+  2. [Finding 2]
+- **Contribution:** [What this adds to the field]
+- **Limitations:** [What we didn't test/couldn't conclude]
+- **Future Work:** [What someone should do next]
diff --git a/skills/research/autoresearch/templates/research-state.yaml b/skills/research/autoresearch/templates/research-state.yaml
new file mode 100644
index 000000000..9a7d30a44
--- /dev/null
+++ b/skills/research/autoresearch/templates/research-state.yaml
@@ -0,0 +1,70 @@
+# Research State
+# Central tracking file for autoresearch project
+# Updated automatically by the agent — do not edit manually
+
+project:
+  name: "{{PROJECT_NAME}}"
+  created_at: "{{CREATED_AT}}"
+  research_question: "{{RESEARCH_QUESTION}}"
+  status: "bootstrapping"  # bootstrapping | active | paused | concluding | completed
+
+bootstrap:
+  literature_searched: false
+  initial_hypotheses_formed: false
+  evaluation_metric_defined: false
+  baseline_established: false
+
+loops:
+  inner_loop_count: 0
+  outer_loop_count: 0
+  last_inner_loop_at: null
+  last_outer_loop_at: null
+
+direction:
+  current: "explore"  # explore | deepen | broaden | pivot | conclude
+  rationale: "Initial exploration phase"
+  next_milestone: "Complete first 3 experiments"
+
+hypotheses:
+  # Example structure — replace with actual hypotheses
+  H001:
+    description: "{{HYPOTHESIS_DESCRIPTION}}"
+    status: "untested"  # untested | running | completed | refuted | supported
+    prediction: "{{PREDICTION}}"
+    priority: 1
+    created_at: "{{CREATED_AT}}"
+    completed_at: null
+    result_summary: null
+    experiment_slug: null
+
+metrics:
+  primary: "{{PRIMARY_METRIC}}"  # e.g., "val_loss", "accuracy", "convergence_steps"
+  baseline_value: null
+  target_value: null
+  current_best: null
+  optimization_direction: "minimize"  # minimize | maximize
+
+trajectory:
+  # Auto-populated from experiments
+  # - experiment_id: run_001
+  #   hypothesis: H001
+  #   metric_value: 0.847
+  #   baseline: 0.812
+  #   delta: "+0.035"
+  #   wall_time_min: 23
+  #   change_summary: "Added cosine annealing"
+
+resources:
+  literature_papers: []
+  related_work_notes: null
+  code_references: []
+
+continuity:
+  last_session_at: "{{CREATED_AT}}"
+  next_scheduled_loop: null
+  current_experiment: null
+  pending_tasks: []
+
+notes: |
+  # Agent notes — context for next session
+  # What was I doing? What's next?

From c77175b7f76d726e2380cded8ac425049bd28fa1 Mon Sep 17 00:00:00 2001
From: Howard Li <howard@oysterlabs.ai>
Date: Sat, 11 Apr 2026 12:28:50 -0700
Subject: [PATCH 2/2] docs(autoresearch): add example research project and
 README

Add LoRA rank convergence study example demonstrating:
- Bootstrap phase with literature search
- Hypothesis formation and testing
- Inner/outer loop workflow
- Progress tracking and findings synthesis
---
 .../research/autoresearch/examples/README.md  | 34 +++++++++
 .../autoresearch/examples/lora-rank-study.md  | 74 +++++++++++++++++++
 2 files changed, 108 insertions(+)
 create mode 100644 skills/research/autoresearch/examples/README.md
 create mode 100644 skills/research/autoresearch/examples/lora-rank-study.md

diff --git a/skills/research/autoresearch/examples/README.md b/skills/research/autoresearch/examples/README.md
new file mode 100644
index 000000000..c61a5ba9e
--- /dev/null
+++ b/skills/research/autoresearch/examples/README.md
@@ -0,0 +1,34 @@
+# Autoresearch Examples
+
+This directory contains example research projects using the autoresearch methodology.
+
+## Available Examples
+
+### `lora-rank-study.md`
+
+**Question:** Does LoRA rank affect convergence speed on small datasets?
+
+**Type:** Benchmark optimization, hyperparameter study
+
+**Skills Used:**
+- `arxiv` — Literature search
+- `mlops` — Model training
+- `tensorboard` — Experiment tracking
+
+**Key Takeaway:** Higher rank improves convergence speed up to a point (r=16), then diminishing returns.
+
+---
+
+## Creating Your Own Research
+
+1. Start with `/autoresearch "your question"`
+2. Follow the two-loop architecture
+3. Commit protocols before running
+4. Generate progress reports with `/research-report`
+
+## Tips from Examples
+
+- **Start small:** First experiment should complete in <30 minutes
+- **Define metrics upfront:** Know what you're measuring before you start
+- **Document surprises:** Negative results are progress too
+- **Show your work:** Progress reports help humans follow along
diff --git a/skills/research/autoresearch/examples/lora-rank-study.md b/skills/research/autoresearch/examples/lora-rank-study.md
new file mode 100644
index 000000000..8ad3ddfd8
--- /dev/null
+++ b/skills/research/autoresearch/examples/lora-rank-study.md
@@ -0,0 +1,74 @@
+# LoRA Rank Convergence Study
+
+**Research Question:** Does LoRA rank affect convergence speed on small datasets?
+
+## Bootstrap
+
+### Literature
+
+Key papers:
+- Hu et al. (2021) — LoRA: Low-Rank Adaptation of Large Language Models
+- Valipour et al. (2023) — DyLoRA: Parameter-Efficient Tuning with Dynamic Search
+
+Gap: Most papers focus on final performance, not convergence dynamics.
+
+### Hypotheses
+
+- **H1:** Higher rank (r=16) converges faster but may overfit on small data
+- **H2:** Lower rank (r=4) converges slower but generalizes better
+- **H3:** There's an optimal rank (r=8) that balances speed and generalization
+
+## Experiments
+
+### H001 — Baseline (r=8)
+
+```bash
+# Protocol: Train with rank 8, measure convergence steps to 90% of max accuracy
+# Prediction: Baseline behavior, ~50 steps to converge
+```
+
+**Results:**
+- Convergence steps: 47
+- Final accuracy: 0.892
+- Wall time: 12 min
+
+### H002 — Low Rank (r=4)
+
+**Results:**
+- Convergence steps: 68 (+44% vs baseline)
+- Final accuracy: 0.887 (-0.6%)
+
+### H003 — High Rank (r=16)
+
+**Results:**
+- Convergence steps: 41 (-13% vs baseline)
+- Final accuracy: 0.894 (+0.2%)
+
+## Outer Loop #1
+
+**Pattern:** Higher rank → faster convergence, minimal overfit on this dataset
+
+**Decision:** DEEPEN — Test r=32 and r=64 to find saturation point
+
+### H004 — Very High Rank (r=32)
+
+**Results:**
+- Convergence steps: 38 (-6% vs r=16)
+- Final accuracy: 0.891 (-0.3%)
+- **Diminishing returns observed**
+
+### H005 — Optimal Search (r=6, r=10, r=12)
+
+[Running...]
+
+## Current Findings
+
+1. Convergence speed improves with rank up to r=16, then plateaus
+2. Final accuracy relatively stable across ranks (±0.5%)
+3. For small datasets, r=8-12 appears optimal (speed vs compute tradeoff)
+
+## Next Steps
+
+- Complete H005-H007
+- Test on different dataset sizes (generalization)
+- Write up findings