feat(research-paper-writing): fill coverage gaps and integrate patterns from AI-Scientist, GPT-Researcher

Fix duplicate step numbers (5.3, 7.3) and missing 7.5. Add coverage for human evaluation, theory/survey/benchmark/position papers, ethics/broader impact, arXiv strategy, code packaging, negative results, workshop papers, multi-author coordination, compute budgeting, and post-acceptance deliverables. Integrate ensemble reviewing with meta-reviewer and negative bias, pre-compilation validation pipeline, experiment journal with tree structure, breadth/depth literature search, context management for large projects, two-pass refinement, VLM visual review, and claim verification. New references: human-evaluation.md, paper-types.md.
2026-06-22 10:32:00 +00:00 · 2026-04-06 01:12:32 -04:00 · 2026-04-06 01:12:32 -04:00 · 95a044a2e0
commit 95a044a2e0
parent 4e196a5428
4 changed files with 1774 additions and 33 deletions
--- a/skills/research/research-paper-writing/SKILL.md
+++ b/skills/research/research-paper-writing/SKILL.md
@ -2,7 +2,7 @@
 name: research-paper-writing
 title: Research Paper Writing Pipeline
 description: End-to-end pipeline for writing ML/AI research papers — from experiment design through analysis, drafting, revision, and submission. Covers NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Integrates automated experiment monitoring, statistical analysis, iterative writing, and citation verification.
-version: 1.0.0
+version: 1.1.0
 author: Orchestra Research
 license: MIT
 dependencies: [semanticscholar, arxiv, habanero, requests, scipy, numpy, matplotlib, SciencePlots]
@ -50,9 +50,12 @@ Use this skill when:
 - **Starting a new research paper** from an existing codebase or idea
 - **Designing and running experiments** to support paper claims
 - **Writing or revising** any section of a research paper
- **Preparing for submission** to a specific conference
+- **Preparing for submission** to a specific conference or workshop
 - **Responding to reviews** with additional experiments or revisions
 - **Converting** a paper between conference formats
+- **Writing non-empirical papers** — theory, survey, benchmark, or position papers (see [Paper Types Beyond Empirical ML](#paper-types-beyond-empirical-ml))
+- **Designing human evaluations** for NLP, HCI, or alignment research
+- **Preparing post-acceptance deliverables** — posters, talks, code releases

 ## Core Philosophy

@ -160,6 +163,69 @@ Research Paper TODO:

 Update this throughout the project. It serves as the persistent state across sessions.

+### Step 0.6: Estimate Compute Budget
+
+Before running experiments, estimate total cost and time:
+
+```
+Compute Budget Checklist:
+- [ ] API costs: (model price per token) × (estimated tokens per run) × (number of runs)
+- [ ] GPU hours: (time per experiment) × (number of experiments) × (number of seeds)
+- [ ] Human evaluation costs: (annotators) × (hours) × (hourly rate)
+- [ ] Total budget ceiling and contingency (add 30-50% for reruns)
+```
+
+Track actual spend as experiments run:
+```python
+# Simple cost tracker pattern
+import json, os
+from datetime import datetime
+
+COST_LOG = "results/cost_log.jsonl"
+
+def log_cost(experiment: str, model: str, input_tokens: int, output_tokens: int, cost_usd: float):
+    entry = {
+        "timestamp": datetime.now().isoformat(),
+        "experiment": experiment,
+        "model": model,
+        "input_tokens": input_tokens,
+        "output_tokens": output_tokens,
+        "cost_usd": cost_usd,
+    }
+    with open(COST_LOG, "a") as f:
+        f.write(json.dumps(entry) + "\n")
+```
+
+**When budget is tight**: Run pilot experiments (1-2 seeds, subset of tasks) before committing to full sweeps. Use cheaper models for debugging pipelines, then switch to target models for final runs.
+
+### Step 0.7: Multi-Author Coordination
+
+Most papers have 3-10 authors. Establish workflows early:
+
+| Workflow | Tool | When to Use |
+|----------|------|-------------|
+| **Overleaf** | Browser-based | Multiple authors editing simultaneously, no git experience |
+| **Git + LaTeX** | `git` with `.gitignore` for aux files | Technical teams, need branch-based review |
+| **Overleaf + Git sync** | Overleaf premium | Best of both — live collab with version history |
+
+**Section ownership**: Assign each section to one primary author. Others comment but don't edit directly. Prevents merge conflicts and style inconsistency.
+
+```
+Author Coordination Checklist:
+- [ ] Agree on section ownership (who writes what)
+- [ ] Set up shared workspace (Overleaf or git repo)
+- [ ] Establish notation conventions (before anyone writes)
+- [ ] Schedule internal review rounds (not just at the end)
+- [ ] Designate one person for final formatting pass
+- [ ] Agree on figure style (colors, fonts, sizes) before creating figures
+```
+
+**LaTeX conventions to agree on early**:
+- `\method{}` macro for consistent method naming
+- Citation style: `\citet{}` vs `\citep{}` usage
+- Math notation: lowercase bold for vectors, uppercase bold for matrices, etc.
+- British vs American spelling
+
 ---

 ## Phase 1: Literature Review
@ -206,6 +272,37 @@ Search queries:
 claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp"
 ```

+### Step 1.2b: Deepen the Search (Breadth-First, Then Depth)
+
+A flat search (one round of queries) typically misses important related work. Use an iterative **breadth-then-depth** pattern inspired by deep research pipelines:
+
+```
+Iterative Literature Search:
+
+Round 1 (Breadth): 4-6 parallel queries covering different angles
+  - "[method] + [domain]"
+  - "[problem name] state-of-the-art 2024 2025"
+  - "[baseline method] comparison"
+  - "[alternative approach] vs [your approach]"
+  → Collect papers, extract key concepts and terminology
+
+Round 2 (Depth): Generate follow-up queries from Round 1 learnings
+  - New terminology discovered in Round 1 papers
+  - Papers cited by the most relevant Round 1 results
+  - Contradictory findings that need investigation
+  → Collect papers, identify remaining gaps
+
+Round 3 (Targeted): Fill specific gaps
+  - Missing baselines identified in Rounds 1-2
+  - Concurrent work (last 6 months, same problem)
+  - Key negative results or failed approaches
+  → Stop when new queries return mostly papers you've already seen
+```
+
+**When to stop**: If a round returns >80% papers already in your collection, the search is saturated. Typically 2-3 rounds suffice. For survey papers, expect 4-5 rounds.
+
+**For agent-based workflows**: Delegate each round's queries in parallel via `delegate_task`. Collect results, deduplicate, then generate the next round's queries from the combined learnings.
+
 ### Step 1.3: Verify Every Citation

 **NEVER generate BibTeX from memory. ALWAYS fetch programmatically.**
@ -327,6 +424,45 @@ make_charts.py                 # Visualization

 See [references/experiment-patterns.md](references/experiment-patterns.md) for complete design patterns, cron monitoring, and error recovery.

+### Step 2.5: Design Human Evaluation (If Applicable)
+
+Many NLP, HCI, and alignment papers require human evaluation as primary or complementary evidence. Design this before running automated experiments — human eval often has longer lead times (IRB approval, annotator recruitment).
+
+**When human evaluation is needed:**
+- Automated metrics don't capture what you care about (fluency, helpfulness, safety)
+- Your contribution is about human-facing qualities (readability, preference, trust)
+- Reviewers at NLP venues (ACL, EMNLP) expect it for generation tasks
+
+**Key design decisions:**
+
+| Decision | Options | Guidance |
+|----------|---------|----------|
+| **Annotator type** | Expert, crowdworker, end-user | Match to what your claims require |
+| **Scale** | Likert (1-5), pairwise comparison, ranking | Pairwise is more reliable than Likert for LLM outputs |
+| **Sample size** | Per annotator and total items | Power analysis or minimum 100 items, 3+ annotators |
+| **Agreement metric** | Cohen's kappa, Krippendorff's alpha, ICC | Krippendorff's alpha for >2 annotators; report raw agreement too |
+| **Platform** | Prolific, MTurk, internal team | Prolific for quality; MTurk for scale; internal for domain expertise |
+
+**Annotation guideline checklist:**
+```
+- [ ] Clear task description with examples (good AND bad)
+- [ ] Decision criteria for ambiguous cases
+- [ ] At least 2 worked examples per category
+- [ ] Attention checks / gold standard items (10-15% of total)
+- [ ] Qualification task or screening round
+- [ ] Estimated time per item and fair compensation (>= local minimum wage)
+- [ ] IRB/ethics review if required by your institution
+```
+
+**Reporting requirements** (reviewers check all of these):
+- Number of annotators and their qualifications
+- Inter-annotator agreement with specific metric and value
+- Compensation details (amount, estimated hourly rate)
+- Annotation interface description or screenshot (appendix)
+- Total annotation time
+
+See [references/human-evaluation.md](references/human-evaluation.md) for complete guide including statistical tests for human eval data, crowdsourcing quality control patterns, and IRB guidance.
+
 ---

 ## Phase 3: Experiment Execution & Monitoring
@ -384,6 +520,38 @@ git commit -m "Add <experiment name>: <key finding in 1 line>"
 git push
 ```

+### Step 3.5: Maintain an Experiment Journal
+
+Git commits track what happened, but not the **exploration tree** — the decisions about what to try next based on what you learned. Maintain a structured experiment journal that captures this tree:
+
+```json
+// experiment_journal.jsonl — append one entry per experiment attempt
+{
+  "id": "exp_003",
+  "parent": "exp_001",
+  "timestamp": "2025-05-10T14:30:00Z",
+  "hypothesis": "Adding scope constraints will fix convergence failure from exp_001",
+  "plan": "Re-run autoreason with max_tokens=2000 and fixed structure template",
+  "config": {"model": "haiku", "strategy": "autoreason", "max_tokens": 2000},
+  "status": "completed",
+  "result_path": "results/exp_003/",
+  "key_metrics": {"win_rate": 0.85, "convergence_rounds": 3},
+  "analysis": "Scope constraints fixed convergence. Win rate jumped from 0.42 to 0.85.",
+  "next_steps": ["Try same constraints on Sonnet", "Test without structure template"],
+  "figures": ["figures/exp003_convergence.pdf"]
+}
+```
+
+**Why a journal, not just git?** Git tracks file changes. The journal tracks the reasoning: why you tried X, what you learned, and what that implies for the next experiment. When writing the paper, this tree is invaluable for the Methods section ("we observed X, which motivated Y") and for honest failure reporting.
+
+**Selecting the best path**: When the journal shows a branching tree (exp_001 → exp_002a, exp_002b, exp_003), identify the path that best supports the paper's claims. Document dead-end branches in the appendix as ablations or negative results.
+
+**Snapshot code per experiment**: Copy the experiment script after each run:
+```bash
+cp experiment.py results/exp_003/experiment_snapshot.py
+```
+This enables exact reproduction even after subsequent code changes.
+
 ---

 ## Phase 4: Result Analysis
@ -433,6 +601,26 @@ After analysis, explicitly answer:
 3. **What failed?** Failed experiments can be the most informative. Honest reporting of failures strengthens the paper.
 4. **What follow-up experiments are needed?** Results often raise new questions.

+#### Handling Negative or Null Results
+
+When your hypothesis was wrong or results are inconclusive, you have three options:
+
+| Situation | Action | Venue Fit |
+|-----------|--------|-----------|
+| Hypothesis wrong but **why** is informative | Frame paper around the analysis of why | NeurIPS, ICML (if analysis is rigorous) |
+| Method doesn't beat baselines but **reveals something new** | Reframe contribution as understanding/analysis | ICLR (values understanding), workshop papers |
+| Clean negative result on popular claim | Write it up — the field needs to know | NeurIPS Datasets & Benchmarks, TMLR, workshops |
+| Results inconclusive, no clear story | Pivot — run different experiments or reframe | Don't force a paper that isn't there |
+
+**How to write a negative results paper:**
+- Lead with what the community believes and why it matters to test it
+- Describe your rigorous methodology (must be airtight — reviewers will scrutinize harder)
+- Present the null result clearly with statistical evidence
+- Analyze **why** the expected result didn't materialize
+- Discuss implications for the field
+
+**Venues that explicitly welcome negative results**: NeurIPS (Datasets & Benchmarks track), TMLR, ML Reproducibility Challenge, workshops at major conferences. Some workshops specifically call for negative results.
+
 ### Step 4.4: Create Figures and Tables

 **Figures**:
@ -469,6 +657,49 @@ Baseline & 85.2 & 45ms \\
 | Missing one ablation reviewers will ask for | Run it, then Phase 5 |
 | All experiments done but some failed | Note failures, move to Phase 5 |

+### Step 4.6: Write the Experiment Log (Bridge to Writeup)
+
+Before moving to paper writing, create a structured experiment log that bridges results to prose. This is the single most important connective tissue between experiments and the writeup — without it, the writing agent has to re-derive the story from raw result files.
+
+**Create `experiment_log.md`** with the following structure:
+
+```markdown
+# Experiment Log
+
+## Contribution (one sentence)
+[The paper's main claim]
+
+## Experiments Run
+
+### Experiment 1: [Name]
+- **Claim tested**: [Which paper claim this supports]
+- **Setup**: [Model, dataset, config, number of runs]
+- **Key result**: [One sentence with the number]
+- **Result files**: results/exp1/final_info.json
+- **Figures generated**: figures/exp1_comparison.pdf
+- **Surprising findings**: [Anything unexpected]
+
+### Experiment 2: [Name]
+...
+
+## Figures
+| Filename | Description | Which section it belongs in |
+|----------|-------------|---------------------------|
+| figures/main_comparison.pdf | Bar chart comparing all methods on benchmark X | Results, Figure 2 |
+| figures/ablation.pdf | Ablation removing components A, B, C | Results, Figure 3 |
+...
+
+## Failed Experiments (document for honesty)
+- [What was tried, why it failed, what it tells us]
+
+## Open Questions
+- [Anything the results raised that the paper should address]
+```
+
+**Why this matters**: When drafting, the agent (or a delegated sub-agent) can load `experiment_log.md` alongside the LaTeX template and produce a first draft grounded in actual results. Without this bridge, the writing agent must parse raw JSON/CSV files and infer the story — a common source of hallucinated or misreported numbers.
+
+**Git discipline**: Commit this log alongside the results it describes.
+
 ---

 ## Iterative Refinement: Strategy Selection
@ -546,6 +777,33 @@ See [references/autoreason-methodology.md](references/autoreason-methodology.md)

 **Goal**: Write a complete, publication-ready paper.

+### Context Management for Large Projects
+
+A paper project with 50+ experiment files, multiple result directories, and extensive literature notes can easily exceed the agent's context window. Manage this proactively:
+
+**What to load into context per drafting task:**
+
+| Drafting Task | Load Into Context | Do NOT Load |
+|---------------|------------------|-------------|
+| Writing Introduction | `experiment_log.md`, contribution statement, 5-10 most relevant paper abstracts | Raw result JSONs, full experiment scripts, all literature notes |
+| Writing Methods | Experiment configs, pseudocode, architecture description | Raw logs, results from other experiments |
+| Writing Results | `experiment_log.md`, result summary tables, figure list | Full analysis scripts, intermediate data |
+| Writing Related Work | Organized citation notes (Step 1.4 output), .bib file | Experiment files, raw PDFs |
+| Revision pass | Full paper draft, specific reviewer concerns | Everything else |
+
+**Principles:**
+- **`experiment_log.md` is the primary context bridge** — it summarizes everything needed for writing without loading raw data files (see Step 4.6)
+- **Load one section's context at a time** when delegating. A sub-agent drafting Methods doesn't need the literature review notes.
+- **Summarize, don't include raw files.** For a 200-line result JSON, load a 10-line summary table. For a 50-page related paper, load the 5-sentence abstract + your 2-line note about its relevance.
+- **For very large projects**: Create a `context/` directory with pre-compressed summaries:
+  ```
+  context/
+    contribution.md          # 1 sentence
+    experiment_summary.md    # Key results table (from experiment_log.md)
+    literature_map.md        # Organized citation notes
+    figure_inventory.md      # List of figures with descriptions
+  ```
+
 ### The Narrative Principle

 **The single most critical insight**: Your paper is not a collection of experiments — it's a story with one clear contribution supported by evidence.
@ -590,6 +848,45 @@ Paper Writing Checklist:
 - [ ] Step 12: Final review
 ```

+### Two-Pass Refinement Pattern
+
+When drafting with an AI agent, use a **two-pass** approach (proven effective in SakanaAI's AI-Scientist pipeline):
+
+**Pass 1 — Write + immediate refine per section:**
+For each section, write a complete draft, then immediately refine it in the same context. This catches local issues (clarity, flow, completeness) while the section is fresh.
+
+**Pass 2 — Global refinement with full-paper context:**
+After all sections are drafted, revisit each section with awareness of the complete paper. This catches cross-section issues: redundancy, inconsistent terminology, narrative flow, and gaps where one section promises something another doesn't deliver.
+
+```
+Second-pass refinement prompt (per section):
+"Review the [SECTION] in the context of the complete paper.
+- Does it fit with the rest of the paper? Are there redundancies with other sections?
+- Is terminology consistent with Introduction and Methods?
+- Can anything be cut without weakening the message?
+- Does the narrative flow from the previous section and into the next?
+Make minimal, targeted edits. Do not rewrite from scratch."
+```
+
+### LaTeX Error Checklist
+
+Append this checklist to every refinement prompt. These are the most common errors when LLMs write LaTeX:
+
+```
+LaTeX Quality Checklist (verify after every edit):
+- [ ] No unenclosed math symbols ($ signs balanced)
+- [ ] Only reference figures/tables that exist (\ref matches \label)
+- [ ] No fabricated citations (\cite matches entries in .bib)
+- [ ] Every \begin{env} has matching \end{env} (especially figure, table, algorithm)
+- [ ] No HTML contamination (</end{figure}> instead of \end{figure})
+- [ ] No unescaped underscores outside math mode (use \_ in text)
+- [ ] No duplicate \label definitions
+- [ ] No duplicate section headers
+- [ ] Numbers in text match actual experimental results
+- [ ] All figures have captions and labels
+- [ ] No overly long lines that cause overfull hbox warnings
+```
+
 ### Step 5.0: Title

 The title is the single most-read element of the paper. It determines whether anyone clicks through to the abstract.
@ -645,7 +942,7 @@ Must include:
 - 2-4 bullet contribution list (max 1-2 lines each in two-column format)
 - Methods should start by page 2-3

-### Step 5.3: Methods
+### Step 5.4: Methods

 Enable reimplementation:
 - Conceptual outline or pseudocode
@ -653,7 +950,7 @@ Enable reimplementation:
 - Architectural details sufficient for reproduction
 - Present final design decisions; ablations go in experiments

-### Step 5.4: Experiments & Results
+### Step 5.5: Experiments & Results

 For each experiment, explicitly state:
 - **What claim it supports**
@ -666,18 +963,18 @@ Requirements:
 - Compute infrastructure (GPU type, total hours)
 - Seed-setting methods

-### Step 5.5: Related Work
+### Step 5.6: Related Work

 Organize methodologically, not paper-by-paper. Cite generously — reviewers likely authored relevant papers.

-### Step 5.6: Limitations (REQUIRED)
+### Step 5.7: Limitations (REQUIRED)

 All major conferences require this. Honesty helps:
 - Reviewers are instructed not to penalize honest limitation acknowledgment
 - Pre-empt criticisms by identifying weaknesses first
 - Explain why limitations don't undermine core claims

-### Step 5.7: Conclusion & Discussion
+### Step 5.8: Conclusion & Discussion

 **Conclusion** (required, 0.5-1 page):
 - Restate the contribution in one sentence (different wording from abstract)
@ -693,7 +990,7 @@ All major conferences require this. Honesty helps:

 **Do NOT** introduce new results or claims in the conclusion.

-### Step 5.8: Appendix Strategy
+### Step 5.9: Appendix Strategy

 Appendices are unlimited at all major venues and are essential for reproducibility. Structure:

@ -728,6 +1025,88 @@ When over the page limit:

 **Do NOT**: reduce font size, change margins, remove required sections (limitations, broader impact), or use `\small`/`\footnotesize` for main text.

+### Step 5.10: Ethics & Broader Impact Statement
+
+Most venues now require or strongly encourage an ethics/broader impact statement. This is not boilerplate — reviewers read it and can flag ethics concerns that trigger desk rejection.
+
+**What to include:**
+
+| Component | Content | Required By |
+|-----------|---------|-------------|
+| **Positive societal impact** | How your work benefits society | NeurIPS, ICML |
+| **Potential negative impact** | Misuse risks, dual-use concerns, failure modes | NeurIPS, ICML |
+| **Fairness & bias** | Does your method/data have known biases? | All venues (implicitly) |
+| **Environmental impact** | Compute carbon footprint for large-scale training | ICML, increasingly NeurIPS |
+| **Privacy** | Does your work use or enable processing of personal data? | ACL, NeurIPS |
+| **LLM disclosure** | Was AI used in writing or experiments? | ICLR (mandatory), ACL |
+
+**Writing the statement:**
+
+```latex
+\section*{Broader Impact Statement}
+% NeurIPS/ICML: after conclusion, does not count toward page limit
+
+% 1. Positive applications (1-2 sentences)
+This work enables [specific application] which may benefit [specific group].
+
+% 2. Risks and mitigations (1-3 sentences, be specific)
+[Method/model] could potentially be misused for [specific risk]. We mitigate
+this by [specific mitigation, e.g., releasing only model weights above size X,
+including safety filters, documenting failure modes].
+
+% 3. Limitations of impact claims (1 sentence)
+Our evaluation is limited to [specific domain]; broader deployment would
+require [specific additional work].
+```
+
+**Common mistakes:**
+- Writing "we foresee no negative impacts" (almost never true — reviewers distrust this)
+- Being vague: "this could be misused" without specifying how
+- Ignoring compute costs for large-scale work
+- Forgetting to disclose LLM use at venues that require it
+
+**Compute carbon footprint** (for training-heavy papers):
+```python
+# Estimate using ML CO2 Impact tool methodology
+gpu_hours = 1000  # total GPU hours
+gpu_tdp_watts = 400  # e.g., A100 = 400W
+pue = 1.1  # Power Usage Effectiveness (data center overhead)
+carbon_intensity = 0.429  # kg CO2/kWh (US average; varies by region)
+
+energy_kwh = (gpu_hours * gpu_tdp_watts * pue) / 1000
+carbon_kg = energy_kwh * carbon_intensity
+print(f"Energy: {energy_kwh:.0f} kWh, Carbon: {carbon_kg:.0f} kg CO2eq")
+```
+
+### Step 5.11: Datasheets & Model Cards (If Applicable)
+
+If your paper introduces a **new dataset** or **releases a model**, include structured documentation. Reviewers increasingly expect this, and NeurIPS Datasets & Benchmarks track requires it.
+
+**Datasheets for Datasets** (Gebru et al., 2021) — include in appendix:
+
+```
+Dataset Documentation (Appendix):
+- Motivation: Why was this dataset created? What task does it support?
+- Composition: What are the instances? How many? What data types?
+- Collection: How was data collected? What was the source?
+- Preprocessing: What cleaning/filtering was applied?
+- Distribution: How is the dataset distributed? Under what license?
+- Maintenance: Who maintains it? How to report issues?
+- Ethical considerations: Contains personal data? Consent obtained?
+  Potential for harm? Known biases?
+```
+
+**Model Cards** (Mitchell et al., 2019) — include in appendix for model releases:
+
+```
+Model Card (Appendix):
+- Model details: Architecture, training data, training procedure
+- Intended use: Primary use cases, out-of-scope uses
+- Metrics: Evaluation metrics and results on benchmarks
+- Ethical considerations: Known biases, fairness evaluations
+- Limitations: Known failure modes, domains where model underperforms
+```
+
 ### Writing Style

 **Sentence-level clarity (Gopen & Swan's 7 Principles):**
@ -1137,31 +1516,104 @@ with plt.style.context(['science', 'no-latex']):

 **Goal**: Simulate the review process before submission. Catch weaknesses early.

-### Step 6.1: Simulate Reviews
+### Step 6.1: Simulate Reviews (Ensemble Pattern)

-Generate reviews from multiple perspectives using strong models (Opus 4, Sonnet 4.6, Gemini 2.5 Pro). Use the reviewer guidelines from the target venue.
+Generate reviews from multiple perspectives. The key insight from automated research pipelines (notably SakanaAI's AI-Scientist): **ensemble reviewing with a meta-reviewer produces far more calibrated feedback than a single review pass.**

-**Review prompt template:**
+**Step 1: Generate N independent reviews** (N=3-5)
+
+Use different models or temperature settings. Each reviewer sees only the paper, not other reviews. **Default to negative bias** — LLMs have well-documented positivity bias in evaluation.

 ```
-You are an expert reviewer for [VENUE]. Review this paper according to the 
-official reviewer guidelines. Evaluate:
+You are an expert reviewer for [VENUE]. You are critical and thorough.
+If a paper has weaknesses or you are unsure about a claim, flag it clearly
+and reflect that in your scores. Do not give the benefit of the doubt.

-1. Quality (technical soundness, baselines, claims supported by evidence)
-2. Clarity (writing, notation consistency, reproducibility)
-3. Significance (impact, importance of the problem)
-4. Originality (novelty, new insights)
+Review this paper according to the official reviewer guidelines. Evaluate:

-Provide:
- Summary (2-3 sentences)
- Strengths (bullet list)
- Weaknesses (bullet list, most critical first)
- Questions for authors
- Missing references
- Score (1-6 on NeurIPS scale)
- Confidence (1-5)
+1. Soundness (are claims well-supported? are baselines fair and strong?)
+2. Clarity (is the paper well-written? could an expert reproduce it?)
+3. Significance (does this matter to the community?)
+4. Originality (new insights, not just incremental combination?)
+
+Provide your review as structured JSON:
+{
+  "summary": "2-3 sentence summary",
+  "strengths": ["strength 1", "strength 2", ...],
+  "weaknesses": ["weakness 1 (most critical)", "weakness 2", ...],
+  "questions": ["question for authors 1", ...],
+  "missing_references": ["paper that should be cited", ...],
+  "soundness": 1-4,
+  "presentation": 1-4,
+  "contribution": 1-4,
+  "overall": 1-10,
+  "confidence": 1-5
+}
 ```

+**Step 2: Meta-review (Area Chair aggregation)**
+
+Feed all N reviews to a meta-reviewer:
+
+```
+You are an Area Chair at [VENUE]. You have received [N] independent reviews
+of a paper. Your job is to:
+
+1. Identify consensus strengths and weaknesses across reviewers
+2. Resolve disagreements by examining the paper directly
+3. Produce a meta-review that represents the aggregate judgment
+4. Use AVERAGED numerical scores across all reviews
+
+Be conservative: if reviewers disagree on whether a weakness is serious,
+treat it as serious until the authors address it.
+
+Reviews:
+[review_1]
+[review_2]
+...
+```
+
+**Step 3: Reflection loop** (optional, 2-3 rounds)
+
+Each reviewer can refine their review after seeing the meta-review. Use an early termination sentinel: if the reviewer responds "I am done" (no changes), stop iterating.
+
+**Model selection for reviewing**: Reviewing is best done with the strongest available model, even if you wrote the paper with a cheaper one. The reviewer model should be chosen independently from the writing model.
+
+**Few-shot calibration**: If available, include 1-2 real published reviews from the target venue as examples. This dramatically improves score calibration. See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for example reviews.
+
+### Step 6.1b: Visual Review Pass (VLM)
+
+Text-only review misses an entire class of problems: figure quality, layout issues, visual consistency. If you have access to a vision-capable model, run a separate **visual review** on the compiled PDF:
+
+```
+You are reviewing the visual presentation of this research paper PDF.
+Check for:
+1. Figure quality: Are plots readable? Labels legible? Colors distinguishable?
+2. Figure-caption alignment: Does each caption accurately describe its figure?
+3. Layout issues: Orphaned section headers, awkward page breaks, figures far from their references
+4. Table formatting: Aligned columns, consistent decimal precision, bold for best results
+5. Visual consistency: Same color scheme across all figures, consistent font sizes
+6. Grayscale readability: Would the figures be understandable if printed in B&W?
+
+For each issue, specify the page number and exact location.
+```
+
+This catches problems that text-based review cannot: a plot with illegible axis labels, a figure placed 3 pages from its first reference, inconsistent color palettes between Figure 2 and Figure 5, or a table that's clearly wider than the column width.
+
+### Step 6.1c: Claim Verification Pass
+
+After simulated reviews, run a separate verification pass. This catches factual errors that reviewers might miss:
+
+```
+Claim Verification Protocol:
+1. Extract every factual claim from the paper (numbers, comparisons, trends)
+2. For each claim, trace it to the specific experiment/result that supports it
+3. Verify the number in the paper matches the actual result file
+4. Flag any claim without a traceable source as [VERIFY]
+```
+
+For agent-based workflows: delegate verification to a **fresh sub-agent** that receives only the paper text and the raw result files. The fresh context prevents confirmation bias — the verifier doesn't "remember" what the results were supposed to be.
+
 ### Step 6.2: Prioritize Feedback

 After collecting reviews, categorize:
@ -1269,21 +1721,77 @@ Pre-Submission Format Check:
 - [ ] Required sections present (limitations, broader impact, etc.)
 ```

-### Step 7.3: Final Compilation
+### Step 7.4: Pre-Compilation Validation
+
+Run these automated checks **before** attempting `pdflatex`. Catching errors here is faster than debugging compiler output.
+
+```bash
+# 1. Lint with chktex (catches common LaTeX mistakes)
+# Suppress noisy warnings: -n2 (sentence end), -n24 (parens), -n13 (intersentence), -n1 (command terminated)
+chktex main.tex -q -n2 -n24 -n13 -n1
+
+# 2. Verify all citations exist in .bib
+# Extract \cite{...} from .tex, check each against .bib
+python3 -c "
+import re
+tex = open('main.tex').read()
+bib = open('references.bib').read()
+cites = set(re.findall(r'\\\\cite[tp]?{([^}]+)}', tex))
+for cite_group in cites:
+    for cite in cite_group.split(','):
+        cite = cite.strip()
+        if cite and cite not in bib:
+            print(f'WARNING: \\\\cite{{{cite}}} not found in references.bib')
+"
+
+# 3. Verify all referenced figures exist on disk
+python3 -c "
+import re, os
+tex = open('main.tex').read()
+figs = re.findall(r'\\\\includegraphics(?:\[.*?\])?{([^}]+)}', tex)
+for fig in figs:
+    if not os.path.exists(fig):
+        print(f'WARNING: Figure file not found: {fig}')
+"
+
+# 4. Check for duplicate \label definitions
+python3 -c "
+import re
+from collections import Counter
+tex = open('main.tex').read()
+labels = re.findall(r'\\\\label{([^}]+)}', tex)
+dupes = {k: v for k, v in Counter(labels).items() if v > 1}
+for label, count in dupes.items():
+    print(f'WARNING: Duplicate label: {label} (appears {count} times)')
+"
+```
+
+Fix any warnings before proceeding. For agent-based workflows: feed chktex output back to the agent with instructions to make minimal fixes.
+
+### Step 7.5: Final Compilation

 ```bash
 # Clean build
 rm -f *.aux *.bbl *.blg *.log *.out *.pdf
 latexmk -pdf main.tex

-# Or manual
-pdflatex main.tex
+# Or manual (triple pdflatex + bibtex for cross-references)
+pdflatex -interaction=nonstopmode main.tex
 bibtex main
-pdflatex main.tex
-pdflatex main.tex
+pdflatex -interaction=nonstopmode main.tex
+pdflatex -interaction=nonstopmode main.tex
+
+# Verify output exists and has content
+ls -la main.pdf
 ```

-### Step 7.4: Conference-Specific Requirements
+**If compilation fails**: Parse the `.log` file for the first error. Common fixes:
+- "Undefined control sequence" → missing package or typo in command name
+- "Missing $ inserted" → math symbol outside math mode
+- "File not found" → wrong figure path or missing .sty file
+- "Citation undefined" → .bib entry missing or bibtex not run
+
+### Step 7.6: Conference-Specific Requirements

 | Venue | Special Requirements |
 |-------|---------------------|
@ -1294,7 +1802,7 @@ pdflatex main.tex
 | **AAAI** | Strict style file — no modifications whatsoever |
 | **COLM** | Frame contribution for language model community |

-### Step 7.6: Conference Resubmission & Format Conversion
+### Step 7.7: Conference Resubmission & Format Conversion

 When converting between venues, **never copy LaTeX preambles between templates**:

@ -1323,7 +1831,7 @@ When expanding: add ablations, expand limitations, include additional baselines,

 **After rejection**: Address reviewer concerns in the new version, but don't include a "changes" section or reference the previous submission (blind review).

-### Step 7.7: Camera-Ready Preparation (Post-Acceptance)
+### Step 7.8: Camera-Ready Preparation (Post-Acceptance)

 After acceptance, prepare the camera-ready version:

@ -1341,6 +1849,249 @@ Camera-Ready Checklist:
 - [ ] Upload supplementary materials (code, data, appendix) to venue portal
 ```

+### Step 7.9: arXiv & Preprint Strategy
+
+Posting to arXiv is standard practice in ML but has important timing and anonymity considerations.
+
+**Timing decision tree:**
+
+| Situation | Recommendation |
+|-----------|---------------|
+| Submitting to double-blind venue (NeurIPS, ICML, ACL) | Post to arXiv **after** submission deadline, not before. Posting before can technically violate anonymity policies, though enforcement varies. |
+| Submitting to ICLR | ICLR explicitly allows arXiv posting before submission. But don't put author names in the submission itself. |
+| Paper already on arXiv, submitting to new venue | Acceptable at most venues. Do NOT update arXiv version during review with changes that reference reviews. |
+| Workshop paper | arXiv is fine at any time — workshops are typically not double-blind. |
+| Want to establish priority | Post immediately if scooping is a concern — but accept the anonymity tradeoff. |
+
+**arXiv category selection** (ML/AI papers):
+
+| Category | Code | Best For |
+|----------|------|----------|
+| Machine Learning | `cs.LG` | General ML methods |
+| Computation and Language | `cs.CL` | NLP, language models |
+| Artificial Intelligence | `cs.AI` | Reasoning, planning, agents |
+| Computer Vision | `cs.CV` | Vision models |
+| Information Retrieval | `cs.IR` | Search, recommendation |
+
+**List primary + 1-2 cross-listed categories.** More categories = more visibility, but only cross-list where genuinely relevant.
+
+**Versioning strategy:**
+- **v1**: Initial submission (matches conference submission)
+- **v2**: Post-acceptance with camera-ready corrections (add "accepted at [Venue]" to abstract)
+- Don't post v2 during the review period with changes that clearly respond to reviewer feedback
+
+```bash
+# Check if your paper's title is already taken on arXiv
+# (before choosing a title)
+pip install arxiv
+python -c "
+import arxiv
+results = list(arxiv.Search(query='ti:\"Your Exact Title\"', max_results=5).results())
+print(f'Found {len(results)} matches')
+for r in results: print(f'  {r.title} ({r.published.year})')
+"
+```
+
+### Step 7.10: Research Code Packaging
+
+Releasing clean, runnable code significantly increases citations and reviewer trust. Package code alongside the camera-ready submission.
+
+**Repository structure:**
+
+```
+your-method/
+  README.md              # Setup, usage, reproduction instructions
+  requirements.txt       # Or environment.yml for conda
+  setup.py               # For pip-installable packages
+  LICENSE                # MIT or Apache 2.0 recommended for research
+  configs/               # Experiment configurations
+  src/                   # Core method implementation
+  scripts/               # Training, evaluation, analysis scripts
+    train.py
+    evaluate.py
+    reproduce_table1.sh  # One script per main result
+  data/                  # Small data or download scripts
+    download_data.sh
+  results/               # Expected outputs for verification
+```
+
+**README template for research code:**
+
+```markdown
+# [Paper Title]
+
+Official implementation of "[Paper Title]" (Venue Year).
+
+## Setup
+[Exact commands to set up environment]
+
+## Reproduction
+To reproduce Table 1: `bash scripts/reproduce_table1.sh`
+To reproduce Figure 2: `python scripts/make_figure2.py`
+
+## Citation
+[BibTeX entry]
+```
+
+**Pre-release checklist:**
+```
+- [ ] Code runs from a clean clone (test on fresh machine or Docker)
+- [ ] All dependencies pinned to specific versions
+- [ ] No hardcoded absolute paths
+- [ ] No API keys, credentials, or personal data in repo
+- [ ] README covers setup, reproduction, and citation
+- [ ] LICENSE file present (MIT or Apache 2.0 for max reuse)
+- [ ] Results are reproducible within expected variance
+- [ ] .gitignore excludes data files, checkpoints, logs
+```
+
+**Anonymous code for submission** (before acceptance):
+```bash
+# Use Anonymous GitHub for double-blind review
+# https://anonymous.4open.science/
+# Upload your repo → get an anonymous URL → put in paper
+```
+
+---
+
+## Phase 8: Post-Acceptance Deliverables
+
+**Goal**: Maximize the impact of your accepted paper through presentation materials and community engagement.
+
+### Step 8.1: Conference Poster
+
+Most conferences require a poster session. Poster design principles:
+
+| Element | Guideline |
+|---------|-----------|
+| **Size** | Check venue requirements (typically 24"x36" or A0 portrait/landscape) |
+| **Content** | Title, authors, 1-sentence contribution, method figure, 2-3 key results, conclusion |
+| **Flow** | Top-left to bottom-right (Z-pattern) or columnar |
+| **Text** | Title readable at 3m, body at 1m. No full paragraphs — bullet points only. |
+| **Figures** | Reuse paper figures at higher resolution. Enlarge key result. |
+
+**Tools**: LaTeX (`beamerposter` package), PowerPoint/Keynote, Figma, Canva.
+
+**Production**: Order 2+ weeks before the conference. Fabric posters are lighter for travel. Many conferences now support virtual/digital posters too.
+
+### Step 8.2: Conference Talk / Spotlight
+
+If awarded an oral or spotlight presentation:
+
+| Talk Type | Duration | Content |
+|-----------|----------|---------|
+| **Spotlight** | 5 min | Problem, approach, one key result. Rehearse to exactly 5 minutes. |
+| **Oral** | 15-20 min | Full story: problem, approach, key results, ablations, limitations. |
+| **Workshop talk** | 10-15 min | Adapt based on workshop audience — may need more background. |
+
+**Slide design rules:**
+- One idea per slide
+- Minimize text — speak the details, don't project them
+- Animate key figures to build understanding step-by-step
+- Include a "takeaway" slide at the end (single sentence contribution)
+- Prepare backup slides for anticipated questions
+
+### Step 8.3: Blog Post / Social Media
+
+An accessible summary significantly increases impact:
+
+- **Twitter/X thread**: 5-8 tweets. Lead with the result, not the method. Include Figure 1 and key result figure.
+- **Blog post**: 800-1500 words. Written for ML practitioners, not reviewers. Skip formalism, emphasize intuition and practical implications.
+- **Project page**: HTML page with abstract, figures, demo, code link, BibTeX. Use GitHub Pages.
+
+**Timing**: Post within 1-2 days of paper appearing on proceedings or arXiv camera-ready.
+
+---
+
+## Workshop & Short Papers
+
+Workshop papers and short papers (e.g., ACL short papers, Findings papers) follow the same pipeline but with different constraints and expectations.
+
+### Workshop Papers
+
+| Property | Workshop | Main Conference |
+|----------|----------|-----------------|
+| **Page limit** | 4-6 pages (typically) | 7-9 pages |
+| **Review standard** | Lower bar for completeness | Must be complete, thorough |
+| **Review process** | Usually single-blind or light review | Double-blind, rigorous |
+| **What's valued** | Interesting ideas, preliminary results, position pieces | Complete empirical story with strong baselines |
+| **arXiv** | Post anytime | Timing matters (see arXiv strategy) |
+| **Contribution bar** | Novel direction, interesting negative result, work-in-progress | Significant advance with strong evidence |
+
+**When to target a workshop:**
+- Early-stage idea you want feedback on before a full paper
+- Negative result that doesn't justify 8+ pages
+- Position piece or opinion on a timely topic
+- Replication study or reproducibility report
+
+### ACL Short Papers & Findings
+
+ACL venues have distinct submission types:
+
+| Type | Pages | What's Expected |
+|------|-------|-----------------|
+| **Long paper** | 8 | Complete study, strong baselines, ablations |
+| **Short paper** | 4 | Focused contribution: one clear point with evidence |
+| **Findings** | 8 | Solid work that narrowly missed main conference |
+
+**Short paper strategy**: Pick ONE claim and support it thoroughly. Don't try to compress a long paper into 4 pages — write a different, more focused paper.
+
+---
+
+## Paper Types Beyond Empirical ML
+
+The main pipeline above targets empirical ML papers. Other paper types require different structures and evidence standards. See [references/paper-types.md](references/paper-types.md) for detailed guidance on each type.
+
+### Theory Papers
+
+**Structure**: Introduction → Preliminaries (definitions, notation) → Main Results (theorems) → Proof Sketches → Discussion → Full Proofs (appendix)
+
+**Key differences from empirical papers:**
+- Contribution is a theorem, bound, or impossibility result — not experimental numbers
+- Methods section replaced by "Preliminaries" and "Main Results"
+- Proofs are the evidence, not experiments (though empirical validation of theory is welcome)
+- Proof sketches in main text, full proofs in appendix is standard practice
+- Experimental section is optional but strengthens the paper if it validates theoretical predictions
+
+**Proof writing principles:**
+- State theorems formally with all assumptions explicit
+- Provide intuition before formal proof ("The key insight is...")
+- Proof sketches should convey the main idea in 0.5-1 page
+- Use `\begin{proof}...\end{proof}` environments
+- Number assumptions and reference them in theorems: "Under Assumptions 1-3, ..."
+
+### Survey / Tutorial Papers
+
+**Structure**: Introduction → Taxonomy / Organization → Detailed Coverage → Open Problems → Conclusion
+
+**Key differences:**
+- Contribution is the organization, synthesis, and identification of open problems — not new methods
+- Must be comprehensive within scope (reviewers will check for missing references)
+- Requires a clear taxonomy or organizational framework
+- Value comes from connections between works that individual papers don't make
+- Best venues: TMLR (survey track), JMLR, Foundations and Trends in ML, ACM Computing Surveys
+
+### Benchmark Papers
+
+**Structure**: Introduction → Task Definition → Dataset Construction → Baseline Evaluation → Analysis → Intended Use & Limitations
+
+**Key differences:**
+- Contribution is the benchmark itself — it must fill a genuine evaluation gap
+- Dataset documentation is mandatory, not optional (see Datasheets, Step 5.11)
+- Must demonstrate the benchmark is challenging (baselines don't saturate it)
+- Must demonstrate the benchmark measures what you claim it measures (construct validity)
+- Best venues: NeurIPS Datasets & Benchmarks track, ACL (resource papers), LREC-COLING
+
+### Position Papers
+
+**Structure**: Introduction → Background → Thesis / Argument → Supporting Evidence → Counterarguments → Implications
+
+**Key differences:**
+- Contribution is an argument, not a result
+- Must engage seriously with counterarguments
+- Evidence can be empirical, theoretical, or logical analysis
+- Best venues: ICML (position track), workshops, TMLR
+
 ---

 ## Hermes Agent Integration
@ -1564,6 +2315,11 @@ See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for d
 | Missing statistical significance | Add error bars, number of runs, statistical tests, confidence intervals. |
 | Scope creep in experiments | Every experiment must map to a specific claim. Cut experiments that don't. |
 | Paper rejected, need to resubmit | See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews. |
+| Missing broader impact statement | See Step 5.10. Most venues require it. "No negative impacts" is almost never credible. |
+| Human eval criticized as weak | See Step 2.5 and [references/human-evaluation.md](references/human-evaluation.md). Report agreement metrics, annotator details, compensation. |
+| Reviewers question reproducibility | Release code (Step 7.9), document all hyperparameters, include seeds and compute details. |
+| Theory paper lacks intuition | Add proof sketches with plain-language explanations before formal proofs. See [references/paper-types.md](references/paper-types.md). |
+| Results are negative/null | See Phase 4.3 on handling negative results. Consider workshops, TMLR, or reframing as analysis. |

 ---

@ -1578,6 +2334,8 @@ See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for d
 | [references/sources.md](references/sources.md) | Complete bibliography of all writing guides, conference guidelines, APIs |
 | [references/experiment-patterns.md](references/experiment-patterns.md) | Experiment design patterns, evaluation protocols, monitoring, error recovery |
 | [references/autoreason-methodology.md](references/autoreason-methodology.md) | Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring |
+| [references/human-evaluation.md](references/human-evaluation.md) | Human evaluation design, annotation guidelines, agreement metrics, crowdsourcing QC, IRB guidance |
+| [references/paper-types.md](references/paper-types.md) | Theory papers (proof writing, theorem structure), survey papers, benchmark papers, position papers |

 ### LaTeX Templates

--- a/skills/research/research-paper-writing/references/human-evaluation.md
+++ b/skills/research/research-paper-writing/references/human-evaluation.md
@ -0,0 +1,476 @@
+# Human Evaluation Guide for ML/AI Research
+
+Comprehensive guide for designing, running, and reporting human evaluations in ML/AI papers. Human evaluation is the primary evidence for many NLP, HCI, and alignment papers, and is increasingly expected as complementary evidence at all ML venues.
+
+---
+
+## Contents
+
+- [When Human Evaluation Is Needed](#when-human-evaluation-is-needed)
+- [Study Design](#study-design)
+- [Annotation Guidelines](#annotation-guidelines)
+- [Platforms and Recruitment](#platforms-and-recruitment)
+- [Quality Control](#quality-control)
+- [Agreement Metrics](#agreement-metrics)
+- [Statistical Analysis for Human Eval](#statistical-analysis-for-human-eval)
+- [Reporting Requirements](#reporting-requirements)
+- [IRB and Ethics](#irb-and-ethics)
+- [Common Pitfalls](#common-pitfalls)
+
+---
+
+## When Human Evaluation Is Needed
+
+| Scenario | Human Eval Required? | Notes |
+|----------|---------------------|-------|
+| Text generation quality (fluency, coherence) | **Yes** | Automated metrics (BLEU, ROUGE) correlate poorly with human judgment |
+| Factual accuracy of generated text | **Strongly recommended** | Automated fact-checking is unreliable |
+| Safety/toxicity evaluation | **Yes for nuanced cases** | Classifiers miss context-dependent harm |
+| Preference between two systems | **Yes** | Most reliable method for comparing LLM outputs |
+| Summarization quality | **Yes** | ROUGE doesn't capture faithfulness or relevance well |
+| Task completion (UI, agents) | **Yes** | User studies are the gold standard |
+| Classification accuracy | **Usually no** | Ground truth labels suffice; human eval adds cost without insight |
+| Perplexity or loss comparisons | **No** | Automated metrics are the correct evaluation |
+
+---
+
+## Study Design
+
+### Evaluation Types
+
+| Type | When to Use | Pros | Cons |
+|------|-------------|------|------|
+| **Pairwise comparison** | Comparing two systems | Most reliable, minimizes scale bias | Only compares pairs, quadratic in systems |
+| **Likert scale** (1-5 or 1-7) | Rating individual outputs | Easy to aggregate | Subjective anchoring, scale compression |
+| **Ranking** | Ordering 3+ systems | Captures full preference order | Cognitive load increases with items |
+| **Best-worst scaling** | Comparing many systems efficiently | More reliable than Likert, linear in items | Requires careful item selection |
+| **Binary judgment** | Yes/no decisions (grammatical? factual?) | Simple, high agreement | Loses nuance |
+| **Error annotation** | Identifying specific error types | Rich diagnostic information | Expensive, requires trained annotators |
+
+**Recommendation for most ML papers**: Pairwise comparison is the most defensible. Reviewers rarely question its validity. For Likert scales, always report both mean and distribution.
+
+### Sample Size Planning
+
+**Minimum viable sample sizes:**
+
+| Study Type | Minimum Items | Minimum Annotators | Notes |
+|------------|--------------|-------------------|-------|
+| Pairwise comparison | 100 pairs | 3 per pair | Detects ~10% win rate difference at p<0.05 |
+| Likert rating | 100 items | 3 per item | Enough for meaningful averages |
+| Ranking | 50 sets | 3 per set | Each set contains all systems being compared |
+| Error annotation | 200 items | 2 per item | Higher agreement expected for structured schemes |
+
+**Power analysis** (for planning more precisely):
+
+```python
+from scipy import stats
+import numpy as np
+
+def sample_size_pairwise(effect_size=0.10, alpha=0.05, power=0.80):
+    """
+    Estimate sample size for pairwise comparison (sign test).
+    effect_size: expected win rate difference from 0.50
+    """
+    p_expected = 0.50 + effect_size
+    # Normal approximation to binomial
+    z_alpha = stats.norm.ppf(1 - alpha / 2)
+    z_beta = stats.norm.ppf(power)
+    n = ((z_alpha * np.sqrt(0.25) + z_beta * np.sqrt(p_expected * (1 - p_expected))) ** 2) / (effect_size ** 2)
+    return int(np.ceil(n))
+
+print(f"Sample size for 10% effect: {sample_size_pairwise(0.10)}")  # ~200
+print(f"Sample size for 15% effect: {sample_size_pairwise(0.15)}")  # ~90
+print(f"Sample size for 20% effect: {sample_size_pairwise(0.20)}")  # ~50
+```
+
+### Controlling for Bias
+
+| Bias | Mitigation |
+|------|-----------|
+| **Order bias** (first item preferred) | Randomize presentation order for each annotator |
+| **Length bias** (longer = better) | Control for length or analyze separately |
+| **Anchoring** (first annotation sets scale) | Include warm-up items (not counted) |
+| **Fatigue** (quality drops over time) | Limit session length (30-45 min max), randomize item order |
+| **Annotator expertise** | Report annotator background; use qualification tasks |
+
+---
+
+## Annotation Guidelines
+
+Well-written annotation guidelines are the single biggest factor in evaluation quality. Invest significant time here.
+
+### Structure of Good Guidelines
+
+```markdown
+# [Task Name] Annotation Guidelines
+
+## Overview
+[1-2 sentences describing the task]
+
+## Definitions
+[Define every term annotators will use in their judgments]
+- Quality: [specific definition for this study]
+- Fluency: [specific definition]
+- Factuality: [specific definition]
+
+## Rating Scale
+[For each scale point, provide:]
+- Numeric value
+- Label (e.g., "Excellent", "Good", "Acceptable", "Poor", "Unacceptable")
+- Definition of what qualifies for this rating
+- 1-2 concrete examples at this level
+
+## Examples
+
+### Example 1: [Rating = 5]
+Input: [exact input]
+Output: [exact output]
+Rating: 5
+Explanation: [why this is a 5]
+
+### Example 2: [Rating = 2]
+Input: [exact input]
+Output: [exact output]
+Rating: 2
+Explanation: [why this is a 2]
+
+[Include at least 2 examples per rating level, covering edge cases]
+
+## Edge Cases
+- If the output is [ambiguous case]: [instruction]
+- If the input is [unusual case]: [instruction]
+
+## Common Mistakes
+- Don't [common annotator error]
+- Don't let [bias] influence your rating
+```
+
+### Pilot Testing
+
+**Always run a pilot** before the full study:
+1. 3-5 annotators, 20-30 items
+2. Compute agreement metrics
+3. Discuss disagreements in group session
+4. Revise guidelines based on confusion points
+5. Run second pilot if agreement was poor (<0.40 kappa)
+
+---
+
+## Platforms and Recruitment
+
+| Platform | Best For | Cost | Quality |
+|----------|----------|------|---------|
+| **Prolific** | General annotation, surveys | $8-15/hr | High (academic-focused pool) |
+| **Amazon MTurk** | Large-scale simple tasks | $5-12/hr | Variable (needs strong QC) |
+| **Surge AI** | NLP-specific annotation | $15-25/hr | Very high (trained annotators) |
+| **Scale AI** | Production-quality labeling | Varies | High (managed workforce) |
+| **Internal team** | Domain expertise required | Varies | Highest for specialized tasks |
+| **Upwork/contractors** | Long-term annotation projects | $10-30/hr | Depends on hiring |
+
+**Fair compensation**: Always pay at least the equivalent of local minimum wage for the annotator's location. Many conferences (ACL in particular) now ask about annotator compensation. Paying below minimum wage is an ethics risk.
+
+**Prolific setup (recommended for most ML papers):**
+1. Create study on prolific.co
+2. Set prescreening filters (language, country, approval rate >95%)
+3. Estimate time per task from pilot → set fair payment
+4. Use Prolific's built-in attention checks or add your own
+5. Collect Prolific IDs for quality tracking (but don't share in paper)
+
+---
+
+## Quality Control
+
+### Attention Checks
+
+Include items where the correct answer is unambiguous:
+
+```python
+# Types of attention checks
+attention_checks = {
+    "instructed_response": "For this item, please select 'Strongly Agree' regardless of content.",
+    "obvious_quality": "Rate this clearly ungrammatical text: 'The cat dog house green yesterday.'",  # Should get lowest score
+    "gold_standard": "Items where expert consensus exists (pre-annotated by authors)",
+    "trap_question": "What color is the sky on a clear day? (embedded in annotation interface)"
+}
+
+# Recommended: 10-15% of total items should be checks
+# Exclusion criterion: fail 2+ attention checks → exclude annotator
+```
+
+### Annotator Qualification
+
+For tasks requiring expertise:
+
+```
+Qualification Task Design:
+1. Create a set of 20-30 items with known-correct labels
+2. Require annotators to complete this before the main task
+3. Set threshold: ≥80% agreement with gold labels to qualify
+4. Record qualification scores for reporting
+```
+
+### Monitoring During Collection
+
+```python
+# Real-time quality monitoring
+def monitor_quality(annotations):
+    """Check for annotation quality issues during collection."""
+    issues = []
+    
+    # 1. Check for straight-lining (same answer for everything)
+    for annotator_id, items in annotations.groupby('annotator'):
+        if items['rating'].nunique() <= 1:
+            issues.append(f"Annotator {annotator_id}: straight-lining detected")
+    
+    # 2. Check time per item (too fast = not reading)
+    median_time = annotations['time_seconds'].median()
+    fast_annotators = annotations.groupby('annotator')['time_seconds'].median()
+    for ann_id, time in fast_annotators.items():
+        if time < median_time * 0.3:
+            issues.append(f"Annotator {ann_id}: suspiciously fast ({time:.0f}s vs median {median_time:.0f}s)")
+    
+    # 3. Check attention check performance
+    checks = annotations[annotations['is_attention_check']]
+    for ann_id, items in checks.groupby('annotator'):
+        accuracy = (items['rating'] == items['gold_rating']).mean()
+        if accuracy < 0.80:
+            issues.append(f"Annotator {ann_id}: failing attention checks ({accuracy:.0%})")
+    
+    return issues
+```
+
+---
+
+## Agreement Metrics
+
+### Which Metric to Use
+
+| Metric | When to Use | Interpretation |
+|--------|-------------|---------------|
+| **Cohen's kappa (κ)** | Exactly 2 annotators, categorical | Chance-corrected agreement |
+| **Fleiss' kappa** | 3+ annotators, all rate same items, categorical | Multi-annotator extension of Cohen's |
+| **Krippendorff's alpha (α)** | Any number of annotators, handles missing data | Most general; recommended default |
+| **ICC (Intraclass Correlation)** | Continuous ratings (Likert) | Consistency among raters |
+| **Percent agreement** | Reporting alongside kappa/alpha | Raw agreement (not chance-corrected) |
+| **Kendall's W** | Rankings | Concordance among rankers |
+
+**Always report at least two**: one chance-corrected metric (kappa or alpha) AND raw percent agreement.
+
+### Interpretation Guide
+
+| Value | Krippendorff's α / Cohen's κ | Quality |
+|-------|-------------------------------|---------|
+| > 0.80 | Excellent agreement | Reliable for most purposes |
+| 0.67 - 0.80 | Good agreement | Acceptable for most ML papers |
+| 0.40 - 0.67 | Moderate agreement | Borderline; discuss in paper |
+| < 0.40 | Poor agreement | Revise guidelines and redo annotation |
+
+**Note**: Krippendorff recommends α > 0.667 as minimum for tentative conclusions. NLP tasks with subjective judgments (fluency, helpfulness) typically achieve 0.40-0.70.
+
+### Implementation
+
+```python
+import numpy as np
+from sklearn.metrics import cohen_kappa_score
+import krippendorff  # pip install krippendorff
+
+def compute_agreement(annotations_matrix):
+    """
+    annotations_matrix: shape (n_items, n_annotators)
+    Values: ratings (int or float). Use np.nan for missing.
+    """
+    results = {}
+    
+    # Krippendorff's alpha (handles missing data, any number of annotators)
+    results['krippendorff_alpha'] = krippendorff.alpha(
+        annotations_matrix.T,  # krippendorff expects (annotators, items)
+        level_of_measurement='ordinal'  # or 'nominal', 'interval', 'ratio'
+    )
+    
+    # Pairwise Cohen's kappa (for 2 annotators at a time)
+    n_annotators = annotations_matrix.shape[1]
+    kappas = []
+    for i in range(n_annotators):
+        for j in range(i + 1, n_annotators):
+            mask = ~np.isnan(annotations_matrix[:, i]) & ~np.isnan(annotations_matrix[:, j])
+            if mask.sum() > 0:
+                k = cohen_kappa_score(
+                    annotations_matrix[mask, i].astype(int),
+                    annotations_matrix[mask, j].astype(int)
+                )
+                kappas.append(k)
+    results['mean_pairwise_kappa'] = np.mean(kappas) if kappas else None
+    
+    # Raw percent agreement
+    agree_count = 0
+    total_count = 0
+    for item in range(annotations_matrix.shape[0]):
+        ratings = annotations_matrix[item, ~np.isnan(annotations_matrix[item, :])]
+        if len(ratings) >= 2:
+            # All annotators agree
+            if len(set(ratings.astype(int))) == 1:
+                agree_count += 1
+            total_count += 1
+    results['percent_agreement'] = agree_count / total_count if total_count > 0 else None
+    
+    return results
+```
+
+---
+
+## Statistical Analysis for Human Eval
+
+### Pairwise Comparisons
+
+```python
+from scipy import stats
+
+def analyze_pairwise(wins_a, wins_b, ties=0):
+    """
+    Analyze pairwise comparison results.
+    wins_a: number of times system A won
+    wins_b: number of times system B won
+    ties: number of ties (excluded from sign test)
+    """
+    n = wins_a + wins_b  # exclude ties
+    
+    # Sign test (exact binomial)
+    p_value = stats.binom_test(wins_a, n, 0.5, alternative='two-sided')
+    
+    # Win rate with 95% CI (Wilson score interval)
+    win_rate = wins_a / n if n > 0 else 0.5
+    z = 1.96
+    denominator = 1 + z**2 / n
+    center = (win_rate + z**2 / (2 * n)) / denominator
+    margin = z * np.sqrt((win_rate * (1 - win_rate) + z**2 / (4 * n)) / n) / denominator
+    ci_lower = center - margin
+    ci_upper = center + margin
+    
+    return {
+        'win_rate_a': win_rate,
+        'win_rate_b': 1 - win_rate,
+        'p_value': p_value,
+        'ci_95': (ci_lower, ci_upper),
+        'significant': p_value < 0.05,
+        'n_comparisons': n,
+        'ties': ties,
+    }
+```
+
+### Likert Scale Analysis
+
+```python
+def analyze_likert(ratings_a, ratings_b):
+    """Compare Likert ratings between two systems (paired)."""
+    # Wilcoxon signed-rank test (non-parametric, paired)
+    stat, p_value = stats.wilcoxon(ratings_a, ratings_b, alternative='two-sided')
+    
+    # Effect size (rank-biserial correlation)
+    n = len(ratings_a)
+    r = 1 - (2 * stat) / (n * (n + 1))
+    
+    return {
+        'mean_a': np.mean(ratings_a),
+        'mean_b': np.mean(ratings_b),
+        'std_a': np.std(ratings_a),
+        'std_b': np.std(ratings_b),
+        'wilcoxon_stat': stat,
+        'p_value': p_value,
+        'effect_size_r': r,
+        'significant': p_value < 0.05,
+    }
+```
+
+### Multiple Comparisons Correction
+
+When comparing more than two systems:
+
+```python
+from statsmodels.stats.multitest import multipletests
+
+# After computing p-values for all pairs
+p_values = [0.03, 0.001, 0.08, 0.04, 0.15, 0.002]
+rejected, corrected_p, _, _ = multipletests(p_values, method='holm')
+# Use corrected p-values in your paper
+```
+
+---
+
+## Reporting Requirements
+
+Reviewers at NLP venues (ACL, EMNLP, NAACL) check for all of these. ML venues (NeurIPS, ICML) increasingly expect them too.
+
+### Mandatory Reporting
+
+```latex
+% In your paper's human evaluation section:
+\paragraph{Annotators.} We recruited [N] annotators via [platform].
+[Describe qualifications or screening.] Annotators were paid
+\$[X]/hour, above the [country] minimum wage.
+
+\paragraph{Agreement.} Inter-annotator agreement was [metric] = [value]
+(Krippendorff's $\alpha$ = [value]; raw agreement = [value]\%).
+[If low: explain why the task is subjective and how you handle disagreements.]
+
+\paragraph{Evaluation Protocol.} Each [item type] was rated by [N]
+annotators on a [scale description]. We collected [total] annotations
+across [N items]. [Describe randomization and blinding.]
+```
+
+### What Goes in the Appendix
+
+```
+Appendix: Human Evaluation Details
+- Full annotation guidelines (verbatim)
+- Screenshot of annotation interface
+- Qualification task details and threshold
+- Attention check items and failure rates
+- Per-annotator agreement breakdown
+- Full results table (not just averages)
+- Compensation calculation
+- IRB approval number (if applicable)
+```
+
+---
+
+## IRB and Ethics
+
+### When IRB Approval Is Needed
+
+| Situation | IRB Required? |
+|-----------|---------------|
+| Crowdworkers rating text quality | **Usually no** (not "human subjects research" at most institutions) |
+| User study with real users | **Yes** at most US/EU institutions |
+| Collecting personal information | **Yes** |
+| Studying annotator behavior/cognition | **Yes** (they become the subject) |
+| Using existing annotated data | **Usually no** (secondary data analysis) |
+
+**Check your institution's policy.** The definition of "human subjects research" varies. When in doubt, submit an IRB protocol — the review is often fast for minimal-risk studies.
+
+### Ethics Checklist for Human Evaluation
+
+```
+- [ ] Annotators informed about task purpose (not deceptive)
+- [ ] Annotators can withdraw at any time without penalty
+- [ ] No personally identifiable information collected beyond platform ID
+- [ ] Content being evaluated does not expose annotators to harm
+  (if it does: content warnings + opt-out + higher compensation)
+- [ ] Fair compensation (>= equivalent local minimum wage)
+- [ ] Data stored securely, access limited to research team
+- [ ] IRB approval obtained if required by institution
+```
+
+---
+
+## Common Pitfalls
+
+| Pitfall | Problem | Fix |
+|---------|---------|-----|
+| Too few annotators (1-2) | No agreement metric possible | Minimum 3 annotators per item |
+| No attention checks | Can't detect low-quality annotations | Include 10-15% attention checks |
+| Not reporting compensation | Reviewers flag as ethics concern | Always report hourly rate |
+| Using only automated metrics for generation | Reviewers will ask for human eval | Add at least pairwise comparison |
+| Not piloting guidelines | Low agreement, wasted budget | Always pilot with 3-5 people first |
+| Reporting only averages | Hides annotator disagreement | Report distribution and agreement |
+| Not controlling for order/position | Position bias inflates results | Randomize presentation order |
+| Conflating annotator agreement with ground truth | High agreement doesn't mean correct | Validate against expert judgments |
--- a/skills/research/research-paper-writing/references/paper-types.md
+++ b/skills/research/research-paper-writing/references/paper-types.md
@ -0,0 +1,481 @@
+# Paper Types Beyond Empirical ML
+
+Guide for writing non-standard paper types: theory papers, survey/tutorial papers, benchmark/dataset papers, and position papers. Each type has distinct structure, evidence standards, and venue expectations.
+
+---
+
+## Contents
+
+- [Theory Papers](#theory-papers)
+- [Survey and Tutorial Papers](#survey-and-tutorial-papers)
+- [Benchmark and Dataset Papers](#benchmark-and-dataset-papers)
+- [Position Papers](#position-papers)
+- [Reproducibility and Replication Papers](#reproducibility-and-replication-papers)
+
+---
+
+## Theory Papers
+
+### When to Write a Theory Paper
+
+Your paper should be a theory paper if:
+- The main contribution is a theorem, bound, impossibility result, or formal characterization
+- Experiments are supplementary validation, not the core evidence
+- The contribution advances understanding rather than achieving state-of-the-art numbers
+
+### Structure
+
+```
+1. Introduction (1-1.5 pages)
+   - Problem statement and motivation
+   - Informal statement of main results
+   - Comparison to prior theoretical work
+   - Contribution bullets (state theorems informally)
+
+2. Preliminaries (0.5-1 page)
+   - Notation table
+   - Formal definitions
+   - Assumptions (numbered, referenced later)
+   - Known results you build on
+
+3. Main Results (2-3 pages)
+   - Theorem statements (formal)
+   - Proof sketches (intuition + key steps)
+   - Corollaries and special cases
+   - Discussion of tightness / optimality
+
+4. Experimental Validation (1-2 pages, optional but recommended)
+   - Do theoretical predictions match empirical behavior?
+   - Synthetic experiments that isolate the phenomenon
+   - Comparison to bounds from prior work
+
+5. Related Work (1 page)
+   - Theoretical predecessors
+   - Empirical work your theory explains
+
+6. Discussion & Open Problems (0.5 page)
+   - Limitations of your results
+   - Conjectures suggested by your analysis
+   - Concrete open problems
+
+Appendix:
+   - Full proofs
+   - Technical lemmas
+   - Extended experimental details
+```
+
+### Writing Theorems
+
+**Template for a well-stated theorem:**
+
+```latex
+\begin{assumption}[Bounded Gradients]\label{assum:bounded-grad}
+There exists $G > 0$ such that $\|\nabla f(x)\| \leq G$ for all $x \in \mathcal{X}$.
+\end{assumption}
+
+\begin{theorem}[Convergence Rate]\label{thm:convergence}
+Under Assumptions~\ref{assum:bounded-grad} and~\ref{assum:smoothness},
+Algorithm~\ref{alg:method} with step size $\eta = \frac{1}{\sqrt{T}}$ satisfies
+\[
+\frac{1}{T}\sum_{t=1}^{T} \mathbb{E}\left[\|\nabla f(x_t)\|^2\right]
+\leq \frac{2(f(x_1) - f^*)}{\sqrt{T}} + \frac{G^2}{\sqrt{T}}.
+\]
+In particular, after $T = O(1/\epsilon^2)$ iterations, we obtain an
+$\epsilon$-stationary point.
+\end{theorem}
+```
+
+**Rules for theorem statements:**
+- State all assumptions explicitly (numbered, with names)
+- Include the formal bound, not just "converges at rate O(·)"
+- Add a plain-language corollary: "In particular, this means..."
+- Compare to known bounds: "This improves over [prior work]'s bound of O(·) by a factor of..."
+
+### Proof Sketches
+
+The proof sketch is the most important part of the main text for a theory paper. Reviewers evaluate whether you have genuine insight or just mechanical derivation.
+
+**Good proof sketch pattern:**
+
+```latex
+\begin{proof}[Proof Sketch of Theorem~\ref{thm:convergence}]
+The key insight is that [one sentence describing the main idea].
+
+The proof proceeds in three steps:
+\begin{enumerate}
+\item \textbf{Decomposition.} We decompose the error into [term A]
+  and [term B] using [technique]. This reduces the problem to
+  bounding each term separately.
+
+\item \textbf{Bounding [term A].} By [assumption/lemma], [term A]
+  is bounded by $O(\cdot)$. The critical observation is that
+  [specific insight that makes this non-trivial].
+
+\item \textbf{Combining.} Choosing $\eta = 1/\sqrt{T}$ balances
+  the two terms, yielding the stated bound.
+\end{enumerate}
+
+The full proof, including the technical lemma for Step 2,
+appears in Appendix~\ref{app:proofs}.
+\end{proof}
+```
+
+**Bad proof sketch**: Restating the theorem with slightly different notation, or just saying "the proof follows standard techniques."
+
+### Full Proofs in Appendix
+
+```latex
+\appendix
+\section{Proofs}\label{app:proofs}
+
+\subsection{Proof of Theorem~\ref{thm:convergence}}
+
+We first establish two technical lemmas.
+
+\begin{lemma}[Descent Lemma]\label{lem:descent}
+Under Assumption~\ref{assum:smoothness}, for any step size $\eta \leq 1/L$:
+\[
+f(x_{t+1}) \leq f(x_t) - \frac{\eta}{2}\|\nabla f(x_t)\|^2 + \frac{\eta^2 L}{2}\|\nabla f(x_t)\|^2.
+\]
+\end{lemma}
+
+\begin{proof}
+[Complete proof with all steps]
+\end{proof}
+
+% Continue with remaining lemmas and main theorem proof
+```
+
+### Common Theory Paper Pitfalls
+
+| Pitfall | Problem | Fix |
+|---------|---------|-----|
+| Assumptions too strong | Trivializes the result | Discuss which assumptions are necessary; prove lower bounds |
+| No comparison to existing bounds | Reviewers can't assess contribution | Add a comparison table of bounds |
+| Proof sketch is just the full proof shortened | Doesn't convey insight | Focus on the 1-2 key ideas; defer mechanics to appendix |
+| No experimental validation | Reviewers question practical relevance | Add synthetic experiments testing predictions |
+| Notation inconsistency | Confuses reviewers | Create a notation table in Preliminaries |
+| Overly complex proofs where simple ones exist | Reviewers suspect error | Prefer clarity over generality |
+
+### Venues for Theory Papers
+
+| Venue | Theory Acceptance Rate | Notes |
+|-------|----------------------|-------|
+| **NeurIPS** | Moderate | Values theory with practical implications |
+| **ICML** | High | Strong theory track |
+| **ICLR** | Moderate | Prefers theory with empirical validation |
+| **COLT** | High | Theory-focused venue |
+| **ALT** | High | Algorithmic learning theory |
+| **STOC/FOCS** | For TCS-flavored results | If contribution is primarily combinatorial/algorithmic |
+| **JMLR** | High | No page limit; good for long proofs |
+
+---
+
+## Survey and Tutorial Papers
+
+### When to Write a Survey
+
+- A subfield has matured enough that synthesis is valuable
+- You've identified connections between works that individual papers don't make
+- Newcomers to the area have no good entry point
+- The landscape has changed significantly since the last survey
+
+**Warning**: Surveys require genuine expertise. A survey by someone outside the field, however comprehensive, will miss nuances and mischaracterize work.
+
+### Structure
+
+```
+1. Introduction (1-2 pages)
+   - Scope definition (what's included and excluded, and why)
+   - Motivation for the survey now
+   - Overview of organization (often with a figure)
+
+2. Background / Problem Formulation (1-2 pages)
+   - Formal problem definition
+   - Notation (used consistently throughout)
+   - Historical context
+
+3. Taxonomy (the core contribution)
+   - Organize methods along meaningful axes
+   - Present taxonomy as a figure or table
+   - Each category gets a subsection
+
+4. Detailed Coverage (bulk of paper)
+   - For each category: representative methods, key ideas, strengths/weaknesses
+   - Comparison tables within and across categories
+   - Don't just describe — analyze and compare
+
+5. Experimental Comparison (if applicable)
+   - Standardized benchmark comparison
+   - Fair hyperparameter tuning for all methods
+   - Not always feasible but significantly strengthens the survey
+
+6. Open Problems & Future Directions (1-2 pages)
+   - Unsolved problems the field should tackle
+   - Promising but underexplored directions
+   - This section is what makes a survey a genuine contribution
+
+7. Conclusion
+```
+
+### Taxonomy Design
+
+The taxonomy is the core intellectual contribution of a survey. It should:
+
+- **Be meaningful**: Categories should correspond to real methodological differences, not arbitrary groupings
+- **Be exhaustive**: Every relevant paper should fit somewhere
+- **Be mutually exclusive** (ideally): Each paper belongs to one primary category
+- **Have informative names**: "Attention-based methods" > "Category 3"
+- **Be visualized**: A figure showing the taxonomy is almost always helpful
+
+**Example taxonomy axes for "LLM Reasoning" survey:**
+- By technique: chain-of-thought, tree-of-thought, self-consistency, tool use
+- By training requirement: prompting-only, fine-tuned, RLHF
+- By reasoning type: mathematical, commonsense, logical, causal
+
+### Writing Standards
+
+- **Cite every relevant paper** — authors will check if their work is included
+- **Be fair** — don't dismiss methods you don't prefer
+- **Synthesize, don't just list** — identify patterns, trade-offs, open questions
+- **Include a comparison table** — even if qualitative (features/properties checklist)
+- **Update before submission** — check arXiv for papers published since you started writing
+
+### Venues for Surveys
+
+| Venue | Notes |
+|-------|-------|
+| **TMLR** (Survey track) | Dedicated survey submissions; no page limit |
+| **JMLR** | Long format, well-respected |
+| **Foundations and Trends in ML** | Invited, but can be proposed |
+| **ACM Computing Surveys** | Broad CS audience |
+| **arXiv** (standalone) | No peer review but high visibility if well-done |
+| **Conference tutorials** | Present as tutorial at NeurIPS/ICML/ACL; write up as paper |
+
+---
+
+## Benchmark and Dataset Papers
+
+### When to Write a Benchmark Paper
+
+- Existing benchmarks don't measure what you think matters
+- A new capability has emerged with no standard evaluation
+- Existing benchmarks are saturated (all methods score >95%)
+- You want to standardize evaluation in a fragmented subfield
+
+### Structure
+
+```
+1. Introduction
+   - What evaluation gap does this benchmark fill?
+   - Why existing benchmarks are insufficient
+
+2. Task Definition
+   - Formal task specification
+   - Input/output format
+   - Evaluation criteria (what makes a good answer?)
+
+3. Dataset Construction
+   - Data source and collection methodology
+   - Annotation process (if human-annotated)
+   - Quality control measures
+   - Dataset statistics (size, distribution, splits)
+
+4. Baseline Evaluation
+   - Run strong baselines (don't just report random/majority)
+   - Show the benchmark is challenging but not impossible
+   - Human performance baseline (if feasible)
+
+5. Analysis
+   - Error analysis on baselines
+   - What makes items hard/easy?
+   - Construct validity: does the benchmark measure what you claim?
+
+6. Intended Use & Limitations
+   - What should this benchmark be used for?
+   - What should it NOT be used for?
+   - Known biases or limitations
+
+7. Datasheet (Appendix)
+   - Full datasheet for datasets (Gebru et al.)
+```
+
+### Evidence Standards
+
+Reviewers evaluate benchmarks on different criteria than methods papers:
+
+| Criterion | What Reviewers Check |
+|-----------|---------------------|
+| **Novelty of evaluation** | Does this measure something existing benchmarks don't? |
+| **Construct validity** | Does the benchmark actually measure the stated capability? |
+| **Difficulty calibration** | Not too easy (saturated) or too hard (random performance) |
+| **Annotation quality** | Agreement metrics, annotator qualifications, guidelines |
+| **Documentation** | Datasheet, license, maintenance plan |
+| **Reproducibility** | Can others use this benchmark easily? |
+| **Ethical considerations** | Bias analysis, consent, sensitive content handling |
+
+### Dataset Documentation (Required)
+
+Follow the Datasheets for Datasets framework (Gebru et al., 2021):
+
+```
+Datasheet Questions:
+1. Motivation
+   - Why was this dataset created?
+   - Who created it and on behalf of whom?
+   - Who funded the creation?
+
+2. Composition
+   - What do the instances represent?
+   - How many instances are there?
+   - Does it contain all possible instances or a sample?
+   - Is there a label? If so, how was it determined?
+   - Are there recommended data splits?
+
+3. Collection Process
+   - How was the data collected?
+   - Who was involved in collection?
+   - Over what timeframe?
+   - Was ethical review conducted?
+
+4. Preprocessing
+   - What preprocessing was done?
+   - Was the "raw" data saved?
+
+5. Uses
+   - What tasks has this been used for?
+   - What should it NOT be used for?
+   - Are there other tasks it could be used for?
+
+6. Distribution
+   - How is it distributed?
+   - Under what license?
+   - Are there any restrictions?
+
+7. Maintenance
+   - Who maintains it?
+   - How can users contact the maintainer?
+   - Will it be updated? How?
+   - Is there an erratum?
+```
+
+### Venues for Benchmark Papers
+
+| Venue | Notes |
+|-------|-------|
+| **NeurIPS Datasets & Benchmarks** | Dedicated track; best venue for this |
+| **ACL** (Resource papers) | NLP-focused datasets |
+| **LREC-COLING** | Language resources |
+| **TMLR** | Good for benchmarks with analysis |
+
+---
+
+## Position Papers
+
+### When to Write a Position Paper
+
+- You have an argument about how the field should develop
+- You want to challenge a widely-held assumption
+- You want to propose a research agenda based on analysis
+- You've identified a systematic problem in current methodology
+
+### Structure
+
+```
+1. Introduction
+   - State your thesis clearly in the first paragraph
+   - Why this matters now
+
+2. Background
+   - Current state of the field
+   - Prevailing assumptions you're challenging
+
+3. Argument
+   - Present your thesis with supporting evidence
+   - Evidence can be: empirical data, theoretical analysis, logical argument,
+     case studies, historical precedent
+   - Be rigorous — this isn't an opinion piece
+
+4. Counterarguments
+   - Engage seriously with the strongest objections
+   - Explain why they don't undermine your thesis
+   - Concede where appropriate — it strengthens credibility
+
+5. Implications
+   - What should the field do differently?
+   - Concrete research directions your thesis suggests
+   - How should evaluation/methodology change?
+
+6. Conclusion
+   - Restate thesis
+   - Call to action
+```
+
+### Writing Standards
+
+- **Lead with the strongest version of your argument** — don't hedge in the first paragraph
+- **Engage with counterarguments honestly** — the best position papers address the strongest objections, not the weakest
+- **Provide evidence** — a position paper without evidence is an editorial
+- **Be concrete** — "the field should do X" is better than "more work is needed"
+- **Don't straw-man existing work** — characterize opposing positions fairly
+
+### Venues for Position Papers
+
+| Venue | Notes |
+|-------|-------|
+| **ICML** (Position track) | Dedicated track for position papers |
+| **NeurIPS** (Workshop papers) | Workshops often welcome position pieces |
+| **ACL** (Theme papers) | When your position aligns with the conference theme |
+| **TMLR** | Accepts well-argued position papers |
+| **CACM** | For broader CS audience |
+
+---
+
+## Reproducibility and Replication Papers
+
+### When to Write a Reproducibility Paper
+
+- You attempted to reproduce a published result and succeeded/failed
+- You want to verify claims under different conditions
+- You've identified that a popular method's performance depends on unreported details
+
+### Structure
+
+```
+1. Introduction
+   - What paper/result are you reproducing?
+   - Why is this reproduction valuable?
+
+2. Original Claims
+   - State the exact claims from the original paper
+   - What evidence was provided?
+
+3. Methodology
+   - Your reproduction approach
+   - Differences from original (if any) and why
+   - What information was missing from the original paper?
+
+4. Results
+   - Side-by-side comparison with original results
+   - Statistical comparison (confidence intervals overlap?)
+   - What reproduced and what didn't?
+
+5. Analysis
+   - If results differ: why? What's sensitive?
+   - Hidden hyperparameters or implementation details?
+   - Robustness to seed, hardware, library versions?
+
+6. Recommendations
+   - For original authors: what should be clarified?
+   - For practitioners: what to watch out for?
+   - For the field: what reproducibility lessons emerge?
+```
+
+### Venues
+
+| Venue | Notes |
+|-------|-------|
+| **ML Reproducibility Challenge** | Annual challenge at NeurIPS |
+| **ReScience** | Journal dedicated to replications |
+| **TMLR** | Accepts reproductions with analysis |
+| **Workshops** | Reproducibility workshops at major conferences |
--- a/skills/research/research-paper-writing/references/sources.md
+++ b/skills/research/research-paper-writing/references/sources.md
@ -157,3 +157,29 @@ This document lists all authoritative sources used to build this skill, organize

 ### For Reviewer Expectations
 → Start with: Venue reviewer guidelines, reviewer-guidelines.md
+
+### For Human Evaluation
+→ Start with: human-evaluation.md, Prolific/MTurk documentation
+
+### For Non-Empirical Papers (Theory, Survey, Benchmark, Position)
+→ Start with: paper-types.md
+
+---
+
+## Human Evaluation & Annotation
+
+| Source | URL | Key Contribution |
+|--------|-----|------------------|
+| **Datasheets for Datasets** | Gebru et al., 2021 ([arXiv](https://arxiv.org/abs/1803.09010)) | Structured dataset documentation framework |
+| **Model Cards for Model Reporting** | Mitchell et al., 2019 ([arXiv](https://arxiv.org/abs/1810.03993)) | Structured model documentation framework |
+| **Crowdsourcing and Human Computation** | [Survey](https://arxiv.org/abs/2202.06516) | Best practices for crowdsourced annotation |
+| **Krippendorff's Alpha** | [Wikipedia](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) | Inter-annotator agreement metric reference |
+| **Prolific** | [prolific.co](https://www.prolific.co/) | Recommended crowdsourcing platform for research |
+
+## Ethics & Broader Impact
+
+| Source | URL | Key Contribution |
+|--------|-----|------------------|
+| **ML CO2 Impact** | [mlco2.github.io](https://mlco2.github.io/impact/) | Compute carbon footprint calculator |
+| **NeurIPS Broader Impact Guide** | [NeurIPS](https://neurips.cc/public/guides/PaperChecklist) | Official guidance on impact statements |
+| **ACL Ethics Policy** | [ACL](https://www.aclweb.org/portal/content/acl-code-ethics) | Ethics requirements for NLP research |