Merge c77175b7f7 into 05d8f11085

2026-04-25 00:51:20 +00:00 · 2026-04-24 19:26:33 -05:00 · 2026-04-24 19:26:33 -05:00 · bb11bc76a0
commit bb11bc76a0
parent 05d8f11085 c77175b7f7
7 changed files with 590 additions and 1 deletions
--- a/skills/research/DESCRIPTION.md
+++ b/skills/research/DESCRIPTION.md
@ -1,3 +1,31 @@
 ---
-description: Skills for academic research, paper discovery, literature review, domain reconnaissance, market data, content monitoring, and scientific knowledge retrieval.
+description: Skills for academic research, paper discovery, literature review, domain reconnaissance, market data, content monitoring, and scientific knowledge retrieval. Includes autoresearch for autonomous, continuous research with iterative experimentation.
 ---
 ## Skill Overview
 | Skill | Purpose | Best For |
 |-------|---------|----------|
 | `autoresearch` | Autonomous research orchestration | Continuous experimentation, hypothesis testing, benchmark optimization |
 | `arxiv` | Search academic papers | Literature surveys, paper discovery |
 | `research-paper-writing` | Write publication-ready papers | Final paper generation |
 | `blogwatcher` | Monitor research blogs | Staying current with new developments |
 | `llm-wiki` | LLM knowledge base | Quick reference on models and techniques |
 | `polymarket` | Prediction market data | Research on market trends and predictions |
 ## Getting Started with Research
 **Quick literature search:**
 ```
 /arxiv "transformer attention mechanisms"
 ```
 **Start autonomous research:**
 ```
 /autoresearch "Does LoRA rank affect convergence speed?"
 ```
 **Write a paper:**
 ```
 /research-paper-writing
 ```
--- a/skills/research/autoresearch/SKILL.md
+++ b/skills/research/autoresearch/SKILL.md
@ -0,0 +1,200 @@
 ---
 name: autoresearch
 description: Autonomous research orchestration for AI coding agents. Run continuous, self-directed research with a two-loop architecture — rapid inner-loop experiments and periodic outer-loop synthesis. Ideal for literature surveys, hypothesis testing, benchmark optimization, and iterative discovery. No human hand-holding required.
 version: 1.0.0
 author: Hermes Agent
 license: MIT
 metadata:
  hermes:
    tags: [Research, Autonomous, Experiments, ML, AI, Literature, Hypothesis, Benchmark, Optimization]
    related_skills: [arxiv, research-paper-writing, web-search, notebooklm]
    config:
      autoresearch.loop_interval_minutes:
        description: "Interval between autonomous research loops (in minutes)"
        default: 20
      autoresearch.max_iterations:
        description: "Maximum number of inner-loop experiments before forced reflection"
        default: 10
      autoresearch.auto_commit:
        description: "Automatically git-commit research milestones"
        default: true
 ---
 # Autoresearch
 **Autonomous research orchestration for AI coding agents.**
 Run continuous, self-directed research with a two-loop architecture:
 - **Inner Loop**: Rapid experiment iteration with clear measurable outcomes
 - **Outer Loop**: Periodic synthesis, pattern discovery, and direction setting
 Ideal for literature surveys, hypothesis testing, benchmark optimization, mechanistic interpretability studies, and any research requiring iterative experimentation.
 ## When to Use
 | Scenario | Use Autoresearch? |
 |----------|-------------------|
 | "I want to explore X and see what works" | ✅ Yes |
 | "Does technique Y improve metric Z?" | ✅ Yes |
 | "What's the state of the art for problem W?" | ✅ Yes (bootstrap + literature) |
 | "Train a model with specific hyperparameters" | ❌ Use domain skills directly |
 | "Run a single evaluation" | ❌ Use evaluation skills directly |
 ## Quick Start
 ```bash
 # Start a research project
 /autoresearch "Does LoRA rank affect convergence speed on small datasets?"
 # Or with the research tool
 research_init(project="lora-rank-study", question="Does LoRA rank affect convergence speed?")
 ```
 ## The Two-Loop Architecture
 ```
 BOOTSTRAP (once)
  ↓
 INNER LOOP (fast, repeating) → Run experiments → Measure → Record → Learn
  ↓ (every N experiments or when stuck)
 OUTER LOOP (reflective) → Synthesize → New hypotheses → Decide direction
  ↓
 CONCLUDE → Write findings → Generate report
 ```
 ### Inner Loop: Experiment Fast
 1. Pick highest-priority untested hypothesis
 2. Write protocol (what change, what prediction, why)
 3. **Lock it**: Commit to git BEFORE running
 4. Run experiment (invoke domain skill)
 5. Sanity check results (converged? baseline correct?)
 6. Measure proxy metric
 7. Record in `experiments/{hypothesis-slug}/`
 8. Update `research-state.yaml`
 9. If stuck → search literature or brainstorm
 ### Outer Loop: Step Back and Synthesize
 1. Review all results since last reflection
 2. Cluster by type: what worked? what didn't?
 3. Ask WHY — identify mechanisms
 4. Update `findings.md` with current understanding
 5. Search literature if results surprise you
 6. Generate new hypotheses if warranted
 7. Decide direction: DEEPEN / BROADEN / PIVOT / CONCLUDE
 ## Workspace Structure
 ```
 {project}/
 ├── research-state.yaml       # Central state tracking
 ├── research-log.md           # Decision timeline
 ├── findings.md               # Evolving narrative synthesis
 ├── literature/               # Papers, survey notes
 ├── src/                      # Reusable code (utils, plotting)
 ├── data/                     # Raw result data
 ├── experiments/              # Per-hypothesis work
 │   └── {hypothesis-slug}/
 │       ├── protocol.md       # What, why, and prediction
 │       ├── code/             # Experiment-specific code
 │       ├── results/          # Raw outputs, metrics
 │       └── analysis.md       # What we learned
 ├── to_human/                 # Progress presentations
 └── paper/                    # Final paper (optional)
 ```
 ## Research Discipline
 ### Lock Before You Run
 Always commit your protocol to git BEFORE executing:
 ```bash
 git add experiments/H001-protocol.md
 git commit -m "research(protocol): H001 — cosine warmup improves convergence"
 # THEN run the experiment
 ```
 This creates temporal proof your plan existed before results.
 ### Confirmatory vs Exploratory
 | Type | Definition | Trust Level |
 |------|------------|-------------|
 | **Confirmatory** | Matches your locked protocol | High |
 | **Exploratory** | Discovered during execution | Medium — needs replication |
 ### Negative Results Are Progress
 A refuted hypothesis tells you something. Log what it rules out and what it suggests.
 ## Commands
 | Command | Description |
 |---------|-------------|
 | `/autoresearch <question>` | Initialize and start research project |
 | `/research-status` | Show current state and progress |
 | `/research-pause` | Pause autonomous loops |
 | `/research-resume` | Resume autonomous loops |
 | `/research-report` | Generate progress presentation |
 | `/research-conclude` | Finalize and write paper |
 ## Configuration
 Add to `~/.hermes/config.yaml`:
 ```yaml
 autoresearch:
  loop_interval_minutes: 20      # How often to check progress
  max_iterations: 10             # Experiments before forced reflection
  auto_commit: true              # Auto-commit milestones
  default_workspace: "./research" # Where to create projects
 ```
 ## Integration with Other Skills
 | Research Phase | Skills to Invoke |
 |----------------|------------------|
 | Literature search | `arxiv`, `web-search`, `notebooklm` |
 | Data preparation | `data-science` tools |
 | Model training | `mlops`, domain-specific skills |
 | Evaluation | `evaluating-llms-harness`, custom evals |
 | Paper writing | `research-paper-writing` |
 | Progress reports | Built-in report generation |
 ## Example: LoRA Rank Study
 ```
 User: /autoresearch "Does LoRA rank affect convergence speed on small datasets?"
 Agent: 
 1. Bootstraps: Searches arxiv for LoRA papers
 2. Forms hypotheses: H1 (rank 4), H2 (rank 8), H3 (rank 16)
 3. Inner loop: Trains 3 models, records convergence steps
 4. Outer loop: Notices rank 8 converges fastest
 5. Deepens: Tests rank 6, 10, 12
 6. Concludes: Generates report with trajectory plot
 ```
 ## Best Practices
 1. **Start simple**: First experiment should run in <30 minutes
 2. **Define metrics upfront**: Lock evaluation criteria before running
 3. **Return to literature**: When stuck or surprised, search papers
 4. **Commit frequently**: Git history is your research log
 5. **Show your work**: Generate progress reports for human review
 6. **Never idle**: If blocked, diagnose, fix, or pivot — but keep moving
 ## References
 - Inspired by Andrej Karpathy's autoresearch methodology
 - Compatible with agentskills.io open standard
 - Built-in templates from `templates/` directory
 ## See Also
 - `templates/research-state.yaml` — State tracking template
 - `templates/findings.md` — Synthesis template
 - `templates/research-log.md` — Decision log template
 - `examples/` — Example research projects
--- a/skills/research/autoresearch/examples/README.md
+++ b/skills/research/autoresearch/examples/README.md
@ -0,0 +1,34 @@
 # Autoresearch Examples
 This directory contains example research projects using the autoresearch methodology.
 ## Available Examples
 ### `lora-rank-study.md`
 **Question:** Does LoRA rank affect convergence speed on small datasets?
 **Type:** Benchmark optimization, hyperparameter study
 **Skills Used:**
 - `arxiv` — Literature search
 - `mlops` — Model training
 - `tensorboard` — Experiment tracking
 **Key Takeaway:** Higher rank improves convergence speed up to a point (r=16), then diminishing returns.
 ---
 ## Creating Your Own Research
 1. Start with `/autoresearch "your question"`
 2. Follow the two-loop architecture
 3. Commit protocols before running
 4. Generate progress reports with `/research-report`
 ## Tips from Examples
 - **Start small:** First experiment should complete in <30 minutes
 - **Define metrics upfront:** Know what you're measuring before you start
 - **Document surprises:** Negative results are progress too
 - **Show your work:** Progress reports help humans follow along
--- a/skills/research/autoresearch/examples/lora-rank-study.md
+++ b/skills/research/autoresearch/examples/lora-rank-study.md
@ -0,0 +1,74 @@
 # LoRA Rank Convergence Study
 **Research Question:** Does LoRA rank affect convergence speed on small datasets?
 ## Bootstrap
 ### Literature
 Key papers:
 - Hu et al. (2021) — LoRA: Low-Rank Adaptation of Large Language Models
 - Valipour et al. (2023) — DyLoRA: Parameter-Efficient Tuning with Dynamic Search
 Gap: Most papers focus on final performance, not convergence dynamics.
 ### Hypotheses
 - **H1:** Higher rank (r=16) converges faster but may overfit on small data
 - **H2:** Lower rank (r=4) converges slower but generalizes better
 - **H3:** There's an optimal rank (r=8) that balances speed and generalization
 ## Experiments
 ### H001 — Baseline (r=8)
 ```bash
 # Protocol: Train with rank 8, measure convergence steps to 90% of max accuracy
 # Prediction: Baseline behavior, ~50 steps to converge
 ```
 **Results:**
 - Convergence steps: 47
 - Final accuracy: 0.892
 - Wall time: 12 min
 ### H002 — Low Rank (r=4)
 **Results:**
 - Convergence steps: 68 (+44% vs baseline)
 - Final accuracy: 0.887 (-0.6%)
 ### H003 — High Rank (r=16)
 **Results:**
 - Convergence steps: 41 (-13% vs baseline)
 - Final accuracy: 0.894 (+0.2%)
 ## Outer Loop #1
 **Pattern:** Higher rank → faster convergence, minimal overfit on this dataset
 **Decision:** DEEPEN — Test r=32 and r=64 to find saturation point
 ### H004 — Very High Rank (r=32)
 **Results:**
 - Convergence steps: 38 (-6% vs r=16)
 - Final accuracy: 0.891 (-0.3%)
 - **Diminishing returns observed**
 ### H005 — Optimal Search (r=6, r=10, r=12)
 [Running...]
 ## Current Findings
 1. Convergence speed improves with rank up to r=16, then plateaus
 2. Final accuracy relatively stable across ranks (±0.5%)
 3. For small datasets, r=8-12 appears optimal (speed vs compute tradeoff)
 ## Next Steps
 - Complete H005-H007
 - Test on different dataset sizes (generalization)
 - Write up findings
--- a/skills/research/autoresearch/templates/findings.md
+++ b/skills/research/autoresearch/templates/findings.md
@ -0,0 +1,93 @@
 # Findings: {{PROJECT_NAME}}
 **Research Question:** {{RESEARCH_QUESTION}}
 **Last Updated:** {{LAST_UPDATED}}
 ---
 ## Current Understanding
 ### What We Know So Far
 [Summarize the current state of knowledge. 2-4 paragraphs.]
 ### Key Patterns and Insights
 | Pattern | Evidence | Confidence |
 |---------|----------|------------|
 | [Pattern 1] | [Which experiments support this] | High/Medium/Low |
 | [Pattern 2] | [Which experiments support this] | High/Medium/Low |
 ### Mechanistic Understanding
 [If applicable: What mechanisms explain the results?]
 ---
 ## Lessons and Constraints
 ### What Works
 - [Specific finding with context]
 - [Another finding]
 ### What Doesn't Work
 - [Failed approach and why]
 - [Constraint discovered]
 ### Critical Parameters
 | Parameter | Sweet Spot | Why |
 |-----------|------------|-----|
 | [Param 1] | [Value/range] | [Explanation] |
 ---
 ## Open Questions
 ### High Priority
 1. [Question that would change the story if answered]
 2. [Another critical question]
 ### Medium Priority
 1. [Nice to know but not blocking]
 ### Answered
 1. ~~[Question]~~ → Answer: [Brief answer with evidence]
 ---
 ## Narrative Arc
 [For paper writing: What's the story? What would the abstract say?]
 ### Contribution Sketch
 [1-2 sentences on what this research contributes]
 ### Implications
 [Who cares? Why does this matter?]
 ---
 ## Next Steps
 ### Immediate (Next 1-3 Experiments)
 - [ ] [Specific experiment]
 - [ ] [Another experiment]
 ### Medium Term
 - [ ] [Broader direction]
 ### If Current Direction Fails
 - [Pivot option 1]
 - [Pivot option 2]
--- a/skills/research/autoresearch/templates/research-log.md
+++ b/skills/research/autoresearch/templates/research-log.md
@ -0,0 +1,90 @@
 # Research Log: {{PROJECT_NAME}}
 **Research Question:** {{RESEARCH_QUESTION}}
 ---
 ## Bootstrap Phase
 ### {{DATE}} — Project Initialization
 - **Action:** Created workspace, initialized state files
 - **Research Question:** {{RESEARCH_QUESTION}}
 - **Initial Thoughts:** [What makes this interesting?]
 ### {{DATE}} — Literature Search
 - **Sources:** arxiv, semantic scholar, web search
 - **Key Papers:**
  - [Paper 1] — [Key finding relevant to question]
  - [Paper 2] — [Key finding]
 - **Gap Identified:** [What's missing in existing work?]
 ### {{DATE}} — Hypothesis Formation
 - **H001:** [Description] → Prediction: [Specific prediction]
 - **H002:** [Description] → Prediction: [Specific prediction]
 - **H003:** [Description] → Prediction: [Specific prediction]
 ---
 ## Inner Loop Log
 ### {{DATE}} — Experiment H001
 - **Hypothesis:** H001
 - **Protocol:** [What was changed, what was predicted]
 - **Git Commit:** `research(protocol): H001 — [description]`
 - **Status:** [Running/Completed/Failed]
 - **Results:**
  - Metric: [Value]
  - Baseline: [Value]
  - Delta: [+/-X]
 - **Interpretation:** [What this means]
 - **Next Action:** [Continue/Adjust/Pivot]
 ### {{DATE}} — Experiment H002
 [Same format...]
 ---
 ## Outer Loop Log
 ### {{DATE}} — Reflection #1 (After {{N}} experiments)
 - **Experiments Reviewed:** H001-H00N
 - **Patterns Observed:**
  - [Pattern 1]
  - [Pattern 2]
 - **Updated Understanding:** [New insights]
 - **Direction Decision:** [DEEPEN/BROADEN/PIVOT/CONCLUDE]
 - **Rationale:** [Why this direction?]
 - **New Hypotheses:**
  - H00N+1: [Description]
  - H00N+2: [Description]
 ---
 ## Direction Changes
 ### {{DATE}} — PIVOT: [New Direction]
 - **From:** [Old direction/assumption]
 - **To:** [New direction]
 - **Trigger:** [What result/surprise caused this?]
 - **New Research Question:** [If changed]
 ---
 ## Conclusion
 ### {{DATE}} — Research Concluded
 - **Final Status:** [Completed/Partial/Abandoned]
 - **Key Findings:**
  1. [Finding 1]
  2. [Finding 2]
 - **Contribution:** [What this adds to the field]
 - **Limitations:** [What we didn't test/couldn't conclude]
 - **Future Work:** [What someone should do next]
--- a/skills/research/autoresearch/templates/research-state.yaml
+++ b/skills/research/autoresearch/templates/research-state.yaml
@ -0,0 +1,70 @@
 # Research State
 # Central tracking file for autoresearch project
 # Updated automatically by the agent — do not edit manually
 project:
  name: "{{PROJECT_NAME}}"
  created_at: "{{CREATED_AT}}"
  research_question: "{{RESEARCH_QUESTION}}"
  status: "bootstrapping"  # bootstrapping | active | paused | concluding | completed
 bootstrap:
  literature_searched: false
  initial_hypotheses_formed: false
  evaluation_metric_defined: false
  baseline_established: false
 loops:
  inner_loop_count: 0
  outer_loop_count: 0
  last_inner_loop_at: null
  last_outer_loop_at: null
 direction:
  current: "explore"  # explore | deepen | broaden | pivot | conclude
  rationale: "Initial exploration phase"
  next_milestone: "Complete first 3 experiments"
 hypotheses:
  # Example structure — replace with actual hypotheses
  H001:
    description: "{{HYPOTHESIS_DESCRIPTION}}"
    status: "untested"  # untested | running | completed | refuted | supported
    prediction: "{{PREDICTION}}"
    priority: 1
    created_at: "{{CREATED_AT}}"
    completed_at: null
    result_summary: null
    experiment_slug: null
 metrics:
  primary: "{{PRIMARY_METRIC}}"  # e.g., "val_loss", "accuracy", "convergence_steps"
  baseline_value: null
  target_value: null
  current_best: null
  optimization_direction: "minimize"  # minimize | maximize
 trajectory:
  # Auto-populated from experiments
  # - experiment_id: run_001
  #   hypothesis: H001
  #   metric_value: 0.847
  #   baseline: 0.812
  #   delta: "+0.035"
  #   wall_time_min: 23
  #   change_summary: "Added cosine annealing"
 resources:
  literature_papers: []
  related_work_notes: null
  code_references: []
 continuity:
  last_session_at: "{{CREATED_AT}}"
  next_scheduled_loop: null
  current_experiment: null
  pending_tasks: []
 notes: |
  # Agent notes — context for next session
  # What was I doing? What's next?