diff --git a/skills/research/ml-paper-writing/SKILL.md b/skills/research/ml-paper-writing/SKILL.md deleted file mode 100644 index 8650ef876..000000000 --- a/skills/research/ml-paper-writing/SKILL.md +++ /dev/null @@ -1,940 +0,0 @@ ---- -name: ml-paper-writing -description: Write publication-ready ML/AI papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Use when drafting papers from research repos, structuring arguments, verifying citations, or preparing camera-ready submissions. Includes LaTeX templates, reviewer guidelines, and citation verification workflows. -version: 1.0.0 -author: Orchestra Research -license: MIT -dependencies: [semanticscholar, arxiv, habanero, requests] -metadata: - hermes: - tags: [Academic Writing, NeurIPS, ICML, ICLR, ACL, AAAI, COLM, LaTeX, Paper Writing, Citations, Research] - ---- - -# ML Paper Writing for Top AI Conferences - -Expert-level guidance for writing publication-ready papers targeting **NeurIPS, ICML, ICLR, ACL, AAAI, and COLM**. This skill combines writing philosophy from top researchers (Nanda, Farquhar, Karpathy, Lipton, Steinhardt) with practical tools: LaTeX templates, citation verification APIs, and conference checklists. - -## Core Philosophy: Collaborative Writing - -**Paper writing is collaborative, but Claude should be proactive in delivering drafts.** - -The typical workflow starts with a research repository containing code, results, and experimental artifacts. Claude's role is to: - -1. **Understand the project** by exploring the repo, results, and existing documentation -2. **Deliver a complete first draft** when confident about the contribution -3. **Search literature** using web search and APIs to find relevant citations -4. **Refine through feedback cycles** when the scientist provides input -5. **Ask for clarification** only when genuinely uncertain about key decisions - -**Key Principle**: Be proactive. If the repo and results are clear, deliver a full draft. Don't block waiting for feedback on every section—scientists are busy. Produce something concrete they can react to, then iterate based on their response. - ---- - -## ⚠️ CRITICAL: Never Hallucinate Citations - -**This is the most important rule in academic writing with AI assistance.** - -### The Problem -AI-generated citations have a **~40% error rate**. Hallucinated references—papers that don't exist, wrong authors, incorrect years, fabricated DOIs—are a serious form of academic misconduct that can result in desk rejection or retraction. - -### The Rule -**NEVER generate BibTeX entries from memory. ALWAYS fetch programmatically.** - -| Action | ✅ Correct | ❌ Wrong | -|--------|-----------|----------| -| Adding a citation | Search API → verify → fetch BibTeX | Write BibTeX from memory | -| Uncertain about a paper | Mark as `[CITATION NEEDED]` | Guess the reference | -| Can't find exact paper | Note: "placeholder - verify" | Invent similar-sounding paper | - -### When You Can't Verify a Citation - -If you cannot programmatically verify a citation, you MUST: - -```latex -% EXPLICIT PLACEHOLDER - requires human verification -\cite{PLACEHOLDER_author2024_verify_this} % TODO: Verify this citation exists -``` - -**Always tell the scientist**: "I've marked [X] citations as placeholders that need verification. I could not confirm these papers exist." - -### Recommended: Install Exa MCP for Paper Search - -For the best paper search experience, install **Exa MCP** which provides real-time academic search: - -**Claude Code:** -```bash -claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp" -``` - -**Cursor / VS Code** (add to MCP settings): -```json -{ - "mcpServers": { - "exa": { - "type": "http", - "url": "https://mcp.exa.ai/mcp" - } - } -} -``` - -Exa MCP enables searches like: -- "Find papers on RLHF for language models published after 2023" -- "Search for transformer architecture papers by Vaswani" -- "Get recent work on sparse autoencoders for interpretability" - -Then verify results with Semantic Scholar API and fetch BibTeX via DOI. - ---- - -## Workflow 0: Starting from a Research Repository - -When beginning paper writing, start by understanding the project: - -``` -Project Understanding: -- [ ] Step 1: Explore the repository structure -- [ ] Step 2: Read README, existing docs, and key results -- [ ] Step 3: Identify the main contribution with the scientist -- [ ] Step 4: Find papers already cited in the codebase -- [ ] Step 5: Search for additional relevant literature -- [ ] Step 6: Outline the paper structure together -- [ ] Step 7: Draft sections iteratively with feedback -``` - -**Step 1: Explore the Repository** - -```bash -# Understand project structure -ls -la -find . -name "*.py" | head -20 -find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding" -``` - -Look for: -- `README.md` - Project overview and claims -- `results/`, `outputs/`, `experiments/` - Key findings -- `configs/` - Experimental settings -- Existing `.bib` files or citation references -- Any draft documents or notes - -**Step 2: Identify Existing Citations** - -Check for papers already referenced in the codebase: - -```bash -# Find existing citations -grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py" -find . -name "*.bib" -``` - -These are high-signal starting points for Related Work—the scientist has already deemed them relevant. - -**Step 3: Clarify the Contribution** - -Before writing, explicitly confirm with the scientist: - -> "Based on my understanding of the repo, the main contribution appears to be [X]. -> The key results show [Y]. Is this the framing you want for the paper, -> or should we emphasize different aspects?" - -**Never assume the narrative—always verify with the human.** - -**Step 4: Search for Additional Literature** - -Use web search to find relevant papers: - -``` -Search queries to try: -- "[main technique] + [application domain]" -- "[baseline method] comparison" -- "[problem name] state-of-the-art" -- Author names from existing citations -``` - -Then verify and retrieve BibTeX using the citation workflow below. - -**Step 5: Deliver a First Draft** - -**Be proactive—deliver a complete draft rather than asking permission for each section.** - -If the repo provides clear results and the contribution is apparent: -1. Write the full first draft end-to-end -2. Present the complete draft for feedback -3. Iterate based on scientist's response - -If genuinely uncertain about framing or major claims: -1. Draft what you can confidently -2. Flag specific uncertainties: "I framed X as the main contribution—let me know if you'd prefer to emphasize Y instead" -3. Continue with the draft rather than blocking - -**Questions to include with the draft** (not before): -- "I emphasized X as the main contribution—adjust if needed" -- "I highlighted results A, B, C—let me know if others are more important" -- "Related work section includes [papers]—add any I missed" - ---- - -## When to Use This Skill - -Use this skill when: -- **Starting from a research repo** to write a paper -- **Drafting or revising** specific sections -- **Finding and verifying citations** for related work -- **Formatting** for conference submission -- **Resubmitting** to a different venue (format conversion) -- **Iterating** on drafts with scientist feedback - -**Always remember**: First drafts are starting points for discussion, not final outputs. - ---- - -## Balancing Proactivity and Collaboration - -**Default: Be proactive. Deliver drafts, then iterate.** - -| Confidence Level | Action | -|-----------------|--------| -| **High** (clear repo, obvious contribution) | Write full draft, deliver, iterate on feedback | -| **Medium** (some ambiguity) | Write draft with flagged uncertainties, continue | -| **Low** (major unknowns) | Ask 1-2 targeted questions, then draft | - -**Draft first, ask with the draft** (not before): - -| Section | Draft Autonomously | Flag With Draft | -|---------|-------------------|-----------------| -| Abstract | Yes | "Framed contribution as X—adjust if needed" | -| Introduction | Yes | "Emphasized problem Y—correct if wrong" | -| Methods | Yes | "Included details A, B, C—add missing pieces" | -| Experiments | Yes | "Highlighted results 1, 2, 3—reorder if needed" | -| Related Work | Yes | "Cited papers X, Y, Z—add any I missed" | - -**Only block for input when:** -- Target venue is unclear (affects page limits, framing) -- Multiple contradictory framings seem equally valid -- Results seem incomplete or inconsistent -- Explicit request to review before continuing - -**Don't block for:** -- Word choice decisions -- Section ordering -- Which specific results to show (make a choice, flag it) -- Citation completeness (draft with what you find, note gaps) - ---- - -## The Narrative Principle - -**The single most critical insight**: Your paper is not a collection of experiments—it's a story with one clear contribution supported by evidence. - -Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about. - -**Three Pillars (must be crystal clear by end of introduction):** - -| Pillar | Description | Example | -|--------|-------------|---------| -| **The What** | 1-3 specific novel claims within cohesive theme | "We prove that X achieves Y under condition Z" | -| **The Why** | Rigorous empirical evidence supporting claims | Strong baselines, experiments distinguishing hypotheses | -| **The So What** | Why readers should care | Connection to recognized community problems | - -**If you cannot state your contribution in one sentence, you don't yet have a paper.** - ---- - -## Paper Structure Workflow - -### Workflow 1: Writing a Complete Paper (Iterative) - -Copy this checklist and track progress. **Each step involves drafting → feedback → revision:** - -``` -Paper Writing Progress: -- [ ] Step 1: Define the one-sentence contribution (with scientist) -- [ ] Step 2: Draft Figure 1 → get feedback → revise -- [ ] Step 3: Draft abstract → get feedback → revise -- [ ] Step 4: Draft introduction → get feedback → revise -- [ ] Step 5: Draft methods → get feedback → revise -- [ ] Step 6: Draft experiments → get feedback → revise -- [ ] Step 7: Draft related work → get feedback → revise -- [ ] Step 8: Draft limitations → get feedback → revise -- [ ] Step 9: Complete paper checklist (required) -- [ ] Step 10: Final review cycle and submission -``` - -**Step 1: Define the One-Sentence Contribution** - -**This step requires explicit confirmation from the scientist.** - -Before writing anything, articulate and verify: -- What is the single thing your paper contributes? -- What was not obvious or present before your work? - -> "I propose framing the contribution as: '[one sentence]'. Does this capture -> what you see as the main takeaway? Should we adjust the emphasis?" - -**Step 2: Draft Figure 1** - -Figure 1 deserves special attention—many readers skip directly to it. -- Convey core idea, approach, or most compelling result -- Use vector graphics (PDF/EPS for plots) -- Write captions that stand alone without main text -- Ensure readability in black-and-white (8% of men have color vision deficiency) - -**Step 3: Write Abstract (5-Sentence Formula)** - -From Sebastian Farquhar (DeepMind): - -``` -1. What you achieved: "We introduce...", "We prove...", "We demonstrate..." -2. Why this is hard and important -3. How you do it (with specialist keywords for discoverability) -4. What evidence you have -5. Your most remarkable number/result -``` - -**Delete** generic openings like "Large language models have achieved remarkable success..." - -**Step 4: Write Introduction (1-1.5 pages max)** - -Must include: -- 2-4 bullet contribution list (max 1-2 lines each in two-column format) -- Clear problem statement -- Brief approach overview -- Methods should start by page 2-3 maximum - -**Step 5: Methods Section** - -Enable reimplementation: -- Conceptual outline or pseudocode -- All hyperparameters listed -- Architectural details sufficient for reproduction -- Present final design decisions; ablations go in experiments - -**Step 6: Experiments Section** - -For each experiment, explicitly state: -- What claim it supports -- How it connects to main contribution -- Experimental setting (details in appendix) -- What to observe: "the blue line shows X, which demonstrates Y" - -Requirements: -- Error bars with methodology (standard deviation vs standard error) -- Hyperparameter search ranges -- Compute infrastructure (GPU type, total hours) -- Seed-setting methods - -**Step 7: Related Work** - -Organize methodologically, not paper-by-paper: - -**Good:** "One line of work uses Floogledoodle's assumption [refs] whereas we use Doobersnoddle's assumption because..." - -**Bad:** "Snap et al. introduced X while Crackle et al. introduced Y." - -Cite generously—reviewers likely authored relevant papers. - -**Step 8: Limitations Section (REQUIRED)** - -All major conferences require this. Counter-intuitively, honesty helps: -- Reviewers are instructed not to penalize honest limitation acknowledgment -- Pre-empt criticisms by identifying weaknesses first -- Explain why limitations don't undermine core claims - -**Step 9: Paper Checklist** - -NeurIPS, ICML, and ICLR all require paper checklists. See [references/checklists.md](references/checklists.md). - ---- - -## Writing Philosophy for Top ML Conferences - -**This section distills the most important writing principles from leading ML researchers.** These aren't optional style suggestions—they're what separates accepted papers from rejected ones. - -> "A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about." — Neel Nanda - -### The Sources Behind This Guidance - -This skill synthesizes writing philosophy from researchers who have published extensively at top venues: - -| Source | Key Contribution | Link | -|--------|-----------------|------| -| **Neel Nanda** (Google DeepMind) | The Narrative Principle, What/Why/So What framework | [How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) | -| **Sebastian Farquhar** (DeepMind) | 5-sentence abstract formula | [How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) | -| **Gopen & Swan** | 7 principles of reader expectations | [Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) | -| **Zachary Lipton** | Word choice, eliminating hedging | [Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) | -| **Jacob Steinhardt** (UC Berkeley) | Precision, consistent terminology | [Writing Tips](https://bounded-regret.ghost.io/) | -| **Ethan Perez** (Anthropic) | Micro-level clarity tips | [Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/) | -| **Andrej Karpathy** | Single contribution focus | Various lectures | - -**For deeper dives into any of these, see:** -- [references/writing-guide.md](references/writing-guide.md) - Full explanations with examples -- [references/sources.md](references/sources.md) - Complete bibliography - -### Time Allocation (From Neel Nanda) - -Spend approximately **equal time** on each of: -1. The abstract -2. The introduction -3. The figures -4. Everything else combined - -**Why?** Most reviewers form judgments before reaching your methods. Readers encounter your paper as: **title → abstract → introduction → figures → maybe the rest.** - -### Writing Style Guidelines - -#### Sentence-Level Clarity (Gopen & Swan's 7 Principles) - -These principles are based on how readers actually process prose. Violating them forces readers to spend cognitive effort on structure rather than content. - -| Principle | Rule | Example | -|-----------|------|---------| -| **Subject-verb proximity** | Keep subject and verb close | ❌ "The model, which was trained on..., achieves" → ✅ "The model achieves... after training on..." | -| **Stress position** | Place emphasis at sentence ends | ❌ "Accuracy improves by 15% when using attention" → ✅ "When using attention, accuracy improves by **15%**" | -| **Topic position** | Put context first, new info after | ✅ "Given these constraints, we propose..." | -| **Old before new** | Familiar info → unfamiliar info | Link backward, then introduce new | -| **One unit, one function** | Each paragraph makes one point | Split multi-point paragraphs | -| **Action in verb** | Use verbs, not nominalizations | ❌ "We performed an analysis" → ✅ "We analyzed" | -| **Context before new** | Set stage before presenting | Explain before showing equation | - -**Full 7 principles with detailed examples:** See [references/writing-guide.md](references/writing-guide.md#the-7-principles-of-reader-expectations) - -#### Micro-Level Tips (Ethan Perez) - -These small changes accumulate into significantly clearer prose: - -- **Minimize pronouns**: ❌ "This shows..." → ✅ "This result shows..." -- **Verbs early**: Position verbs near sentence start -- **Unfold apostrophes**: ❌ "X's Y" → ✅ "The Y of X" (when awkward) -- **Delete filler words**: "actually," "a bit," "very," "really," "basically," "quite," "essentially" - -**Full micro-tips with examples:** See [references/writing-guide.md](references/writing-guide.md#micro-level-writing-tips) - -#### Word Choice (Zachary Lipton) - -- **Be specific**: ❌ "performance" → ✅ "accuracy" or "latency" (say what you mean) -- **Eliminate hedging**: Drop "may" and "can" unless genuinely uncertain -- **Avoid incremental vocabulary**: ❌ "combine," "modify," "expand" → ✅ "develop," "propose," "introduce" -- **Delete intensifiers**: ❌ "provides *very* tight approximation" → ✅ "provides tight approximation" - -#### Precision Over Brevity (Jacob Steinhardt) - -- **Consistent terminology**: Different terms for same concept creates confusion. Pick one and stick with it. -- **State assumptions formally**: Before theorems, list all assumptions explicitly -- **Intuition + rigor**: Provide intuitive explanations alongside formal proofs - -### What Reviewers Actually Read - -Understanding reviewer behavior helps prioritize your effort: - -| Paper Section | % Reviewers Who Read | Implication | -|---------------|---------------------|-------------| -| Abstract | 100% | Must be perfect | -| Introduction | 90%+ (skimmed) | Front-load contribution | -| Figures | Examined before methods | Figure 1 is critical | -| Methods | Only if interested | Don't bury the lede | -| Appendix | Rarely | Put only supplementary details | - -**Bottom line**: If your abstract and intro don't hook reviewers, they may never read your brilliant methods section. - ---- - -## Conference Requirements Quick Reference - -| Conference | Page Limit | Extra for Camera-Ready | Key Requirement | -|------------|------------|------------------------|-----------------| -| **NeurIPS 2025** | 9 pages | +0 | Mandatory checklist, lay summary for accepted | -| **ICML 2026** | 8 pages | +1 | Broader Impact Statement required | -| **ICLR 2026** | 9 pages | +1 | LLM disclosure required, reciprocal reviewing | -| **ACL 2025** | 8 pages (long) | varies | Limitations section mandatory | -| **AAAI 2026** | 7 pages | +1 | Strict style file adherence | -| **COLM 2025** | 9 pages | +1 | Focus on language models | - -**Universal Requirements:** -- Double-blind review (anonymize submissions) -- References don't count toward page limit -- Appendices unlimited but reviewers not required to read -- LaTeX required for all venues - -**LaTeX Templates:** See [templates/](templates/) directory for all conference templates. - ---- - -## Using LaTeX Templates Properly - -### Workflow 4: Starting a New Paper from Template - -**Always copy the entire template directory first, then write within it.** - -``` -Template Setup Checklist: -- [ ] Step 1: Copy entire template directory to new project -- [ ] Step 2: Verify template compiles as-is (before any changes) -- [ ] Step 3: Read the template's example content to understand structure -- [ ] Step 4: Replace example content section by section -- [ ] Step 5: Keep template comments/examples as reference until done -- [ ] Step 6: Clean up template artifacts only at the end -``` - -**Step 1: Copy the Full Template** - -```bash -# Create your paper directory with the complete template -cp -r templates/neurips2025/ ~/papers/my-new-paper/ -cd ~/papers/my-new-paper/ - -# Verify structure is complete -ls -la -# Should see: main.tex, neurips.sty, Makefile, etc. -``` - -**⚠️ IMPORTANT**: Copy the ENTIRE directory, not just `main.tex`. Templates include: -- Style files (`.sty`) - required for compilation -- Bibliography styles (`.bst`) - required for references -- Example content - useful as reference -- Makefiles - for easy compilation - -**Step 2: Verify Template Compiles First** - -Before making ANY changes, compile the template as-is: - -```bash -# Using latexmk (recommended) -latexmk -pdf main.tex - -# Or manual compilation -pdflatex main.tex -bibtex main -pdflatex main.tex -pdflatex main.tex -``` - -If the unmodified template doesn't compile, fix that first. Common issues: -- Missing TeX packages → install via `tlmgr install ` -- Wrong TeX distribution → use TeX Live (recommended) - -**Step 3: Keep Template Content as Reference** - -Don't immediately delete all example content. Instead: - -```latex -% KEEP template examples commented out as you write -% This shows you the expected format - -% Template example (keep for reference): -% \begin{figure}[t] -% \centering -% \includegraphics[width=0.8\linewidth]{example-image} -% \caption{Template shows caption style} -% \end{figure} - -% Your actual figure: -\begin{figure}[t] - \centering - \includegraphics[width=0.8\linewidth]{your-figure.pdf} - \caption{Your caption following the same style.} -\end{figure} -``` - -**Step 4: Replace Content Section by Section** - -Work through the paper systematically: - -``` -Replacement Order: -1. Title and authors (anonymize for submission) -2. Abstract -3. Introduction -4. Methods -5. Experiments -6. Related Work -7. Conclusion -8. References (your .bib file) -9. Appendix -``` - -For each section: -1. Read the template's example content -2. Note any special formatting or macros used -3. Replace with your content following the same patterns -4. Compile frequently to catch errors early - -**Step 5: Use Template Macros** - -Templates often define useful macros. Check the preamble for: - -```latex -% Common template macros to use: -\newcommand{\method}{YourMethodName} % Consistent method naming -\newcommand{\eg}{e.g.,\xspace} % Proper abbreviations -\newcommand{\ie}{i.e.,\xspace} -\newcommand{\etal}{\textit{et al.}\xspace} -``` - -**Step 6: Clean Up Only at the End** - -Only remove template artifacts when paper is nearly complete: - -```latex -% BEFORE SUBMISSION - remove these: -% - Commented-out template examples -% - Unused packages -% - Template's example figures/tables -% - Lorem ipsum or placeholder text - -% KEEP these: -% - All style files (.sty) -% - Bibliography style (.bst) -% - Required packages from template -% - Any custom macros you're using -``` - -### Template Pitfalls to Avoid - -| Pitfall | Problem | Solution | -|---------|---------|----------| -| Copying only `main.tex` | Missing `.sty`, won't compile | Copy entire directory | -| Modifying `.sty` files | Breaks conference formatting | Never edit style files | -| Adding random packages | Conflicts, breaks template | Only add if necessary | -| Deleting template content too early | Lose formatting reference | Keep as comments until done | -| Not compiling frequently | Errors accumulate | Compile after each section | - -### Quick Template Reference - -| Conference | Main File | Key Style File | Notes | -|------------|-----------|----------------|-------| -| NeurIPS 2025 | `main.tex` | `neurips.sty` | Has Makefile | -| ICML 2026 | `example_paper.tex` | `icml2026.sty` | Includes algorithm packages | -| ICLR 2026 | `iclr2026_conference.tex` | `iclr2026_conference.sty` | Has math_commands.tex | -| ACL | `acl_latex.tex` | `acl.sty` | Strict formatting | -| AAAI 2026 | `aaai2026-unified-template.tex` | `aaai2026.sty` | Very strict compliance | -| COLM 2025 | `colm2025_conference.tex` | `colm2025_conference.sty` | Similar to ICLR | - ---- - -## Conference Resubmission & Format Conversion - -When a paper is rejected or withdrawn from one venue and resubmitted to another, format conversion is required. This is a common workflow in ML research. - -### Workflow 3: Converting Between Conference Formats - -``` -Format Conversion Checklist: -- [ ] Step 1: Identify source and target template differences -- [ ] Step 2: Create new project with target template -- [ ] Step 3: Copy content sections (not preamble) -- [ ] Step 4: Adjust page limits and content -- [ ] Step 5: Update conference-specific requirements -- [ ] Step 6: Verify compilation and formatting -``` - -**Step 1: Key Template Differences** - -| From → To | Page Change | Key Adjustments | -|-----------|-------------|-----------------| -| NeurIPS → ICML | 9 → 8 pages | Cut 1 page, add Broader Impact if missing | -| ICML → ICLR | 8 → 9 pages | Can expand experiments, add LLM disclosure | -| NeurIPS → ACL | 9 → 8 pages | Restructure for NLP conventions, add Limitations | -| ICLR → AAAI | 9 → 7 pages | Significant cuts needed, strict style adherence | -| Any → COLM | varies → 9 | Reframe for language model focus | - -**Step 2: Content Migration (NOT Template Merge)** - -**Never copy LaTeX preambles between templates.** Instead: - -```bash -# 1. Start fresh with target template -cp -r templates/icml2026/ new_submission/ - -# 2. Copy ONLY content sections from old paper -# - Abstract text -# - Section content (between \section{} commands) -# - Figures and tables -# - Bibliography entries - -# 3. Paste into target template structure -``` - -**Step 3: Adjusting for Page Limits** - -When cutting pages (e.g., NeurIPS 9 → AAAI 7): -- Move detailed proofs to appendix -- Condense related work (cite surveys instead of individual papers) -- Combine similar experiments into unified tables -- Use smaller figure sizes with subfigures -- Tighten writing: eliminate redundancy, use active voice - -When expanding (e.g., ICML 8 → ICLR 9): -- Add ablation studies reviewers requested -- Expand limitations discussion -- Include additional baselines -- Add qualitative examples - -**Step 4: Conference-Specific Adjustments** - -| Target Venue | Required Additions | -|--------------|-------------------| -| **ICML** | Broader Impact Statement (after conclusion) | -| **ICLR** | LLM usage disclosure, reciprocal reviewing agreement | -| **ACL/EMNLP** | Limitations section (mandatory), Ethics Statement | -| **AAAI** | Strict adherence to style file (no modifications) | -| **NeurIPS** | Paper checklist (appendix), lay summary if accepted | - -**Step 5: Update References** - -```latex -% Remove self-citations that reveal identity (for blind review) -% Update any "under review" citations to published versions -% Add new relevant work published since last submission -``` - -**Step 6: Addressing Previous Reviews** - -When resubmitting after rejection: -- **Do** address reviewer concerns in the new version -- **Do** add experiments/clarifications reviewers requested -- **Don't** include a "changes from previous submission" section (blind review) -- **Don't** reference the previous submission or reviews - -**Common Conversion Pitfalls:** -- ❌ Copying `\usepackage` commands (causes conflicts) -- ❌ Keeping old conference header/footer commands -- ❌ Forgetting to update `\bibliography{}` path -- ❌ Missing conference-specific required sections -- ❌ Exceeding page limit after format change - ---- - -## Citation Workflow (Hallucination Prevention) - -**⚠️ CRITICAL**: AI-generated citations have ~40% error rate. **Never write BibTeX from memory.** - -### The Golden Rule - -``` -IF you cannot programmatically fetch a citation: - → Mark it as [CITATION NEEDED] or [PLACEHOLDER - VERIFY] - → Tell the scientist explicitly - → NEVER invent a plausible-sounding reference -``` - -### Workflow 2: Adding Citations - -``` -Citation Verification (MANDATORY for every citation): -- [ ] Step 1: Search using Exa MCP or Semantic Scholar API -- [ ] Step 2: Verify paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef) -- [ ] Step 3: Retrieve BibTeX via DOI (programmatically, not from memory) -- [ ] Step 4: Verify the claim you're citing actually appears in the paper -- [ ] Step 5: Add verified BibTeX to bibliography -- [ ] Step 6: If ANY step fails → mark as placeholder, inform scientist -``` - -**Step 0: Use Exa MCP for Initial Search (Recommended)** - -If Exa MCP is installed, use it to find relevant papers: -``` -Search: "RLHF language model alignment 2023" -Search: "sparse autoencoders interpretability" -Search: "attention mechanism transformers Vaswani" -``` - -Then verify each result with Semantic Scholar and fetch BibTeX via DOI. - -**Step 1: Search Semantic Scholar** - -```python -from semanticscholar import SemanticScholar - -sch = SemanticScholar() -results = sch.search_paper("attention mechanism transformers", limit=5) -for paper in results: - print(f"{paper.title} - {paper.paperId}") - print(f" DOI: {paper.externalIds.get('DOI', 'N/A')}") -``` - -**Step 2: Verify Existence** - -Confirm paper appears in at least two sources (Semantic Scholar + CrossRef/arXiv). - -**Step 3: Retrieve BibTeX via DOI** - -```python -import requests - -def doi_to_bibtex(doi: str) -> str: - """Get verified BibTeX from DOI via CrossRef.""" - response = requests.get( - f"https://doi.org/{doi}", - headers={"Accept": "application/x-bibtex"} - ) - response.raise_for_status() - return response.text - -# Example -bibtex = doi_to_bibtex("10.48550/arXiv.1706.03762") -print(bibtex) -``` - -**Step 4: Verify Claims** - -Before citing for a specific claim, access the paper and confirm the attributed claim actually appears. - -**Step 5: Handle Failures Explicitly** - -If you cannot verify a citation at ANY step: - -```latex -% Option 1: Explicit placeholder -\cite{PLACEHOLDER_smith2023_verify} % TODO: Could not verify - scientist must confirm - -% Option 2: Note in text -... as shown in prior work [CITATION NEEDED - could not verify Smith et al. 2023]. -``` - -**Always inform the scientist:** -> "I could not verify the following citations and have marked them as placeholders: -> - Smith et al. 2023 on reward hacking - could not find in Semantic Scholar -> - Jones 2022 on scaling laws - found similar paper but different authors -> Please verify these before submission." - -### Summary: Citation Rules - -| Situation | Action | -|-----------|--------| -| Found paper, got DOI, fetched BibTeX | ✅ Use the citation | -| Found paper, no DOI | ✅ Use arXiv BibTeX or manual entry from paper | -| Paper exists but can't fetch BibTeX | ⚠️ Mark placeholder, inform scientist | -| Uncertain if paper exists | ❌ Mark `[CITATION NEEDED]`, inform scientist | -| "I think there's a paper about X" | ❌ **NEVER cite** - search first or mark placeholder | - -**🚨 NEVER generate BibTeX from memory—always fetch programmatically. 🚨** - -See [references/citation-workflow.md](references/citation-workflow.md) for complete API documentation. - ---- - -## Common Issues and Solutions - -**Issue: Abstract too generic** - -Delete first sentence if it could be prepended to any ML paper. Start with your specific contribution. - -**Issue: Introduction exceeds 1.5 pages** - -Split background into Related Work. Front-load contribution bullets. Methods should start by page 2-3. - -**Issue: Experiments lack explicit claims** - -Add sentence before each experiment: "This experiment tests whether [specific claim]..." - -**Issue: Reviewers find paper hard to follow** - -- Add explicit signposting: "In this section, we show X" -- Use consistent terminology throughout -- Include figure captions that stand alone - -**Issue: Missing statistical significance** - -Always include: -- Error bars (specify: std dev or std error) -- Number of runs -- Statistical tests if comparing methods - ---- - -## Reviewer Evaluation Criteria - -Reviewers assess papers on four dimensions: - -| Criterion | What Reviewers Look For | -|-----------|------------------------| -| **Quality** | Technical soundness, well-supported claims | -| **Clarity** | Clear writing, reproducible by experts | -| **Significance** | Community impact, advances understanding | -| **Originality** | New insights (doesn't require new method) | - -**Scoring (NeurIPS 6-point scale):** -- 6: Strong Accept - Groundbreaking, flawless -- 5: Accept - Technically solid, high impact -- 4: Borderline Accept - Solid, limited evaluation -- 3: Borderline Reject - Solid but weaknesses outweigh -- 2: Reject - Technical flaws -- 1: Strong Reject - Known results or ethics issues - -See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for detailed reviewer instructions. - ---- - -## Tables and Figures - -### Tables - -Use `booktabs` LaTeX package for professional tables: - -```latex -\usepackage{booktabs} -\begin{tabular}{lcc} -\toprule -Method & Accuracy ↑ & Latency ↓ \\ -\midrule -Baseline & 85.2 & 45ms \\ -\textbf{Ours} & \textbf{92.1} & 38ms \\ -\bottomrule -\end{tabular} -``` - -**Rules:** -- Bold best value per metric -- Include direction symbols (↑ higher is better, ↓ lower is better) -- Right-align numerical columns -- Consistent decimal precision - -### Figures - -- **Vector graphics** (PDF, EPS) for all plots and diagrams -- **Raster** (PNG 600 DPI) only for photographs -- Use **colorblind-safe palettes** (Okabe-Ito or Paul Tol) -- Verify **grayscale readability** (8% of men have color vision deficiency) -- **No title inside figure**—the caption serves this function -- **Self-contained captions**—reader should understand without main text - ---- - -## References & Resources - -### Reference Documents (Deep Dives) - -| Document | Contents | -|----------|----------| -| [writing-guide.md](references/writing-guide.md) | Gopen & Swan 7 principles, Ethan Perez micro-tips, word choice | -| [citation-workflow.md](references/citation-workflow.md) | Citation APIs, Python code, BibTeX management | -| [checklists.md](references/checklists.md) | NeurIPS 16-item, ICML, ICLR, ACL requirements | -| [reviewer-guidelines.md](references/reviewer-guidelines.md) | Evaluation criteria, scoring, rebuttals | -| [sources.md](references/sources.md) | Complete bibliography of all sources | - -### LaTeX Templates - -Templates in `templates/` directory: **ICML 2026**, **ICLR 2026**, **NeurIPS 2025**, **ACL/EMNLP**, **AAAI 2026**, **COLM 2025**. - -**Compiling to PDF:** -- **VS Code/Cursor**: Install LaTeX Workshop extension + TeX Live → Save to auto-compile -- **Command line**: `latexmk -pdf main.tex` or `pdflatex` + `bibtex` workflow -- **Online**: Upload to [Overleaf](https://overleaf.com) - -See [templates/README.md](templates/README.md) for detailed setup instructions. - -### Key External Sources - -**Writing Philosophy:** -- [Neel Nanda: How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) - Narrative, "What/Why/So What" -- [Farquhar: How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) - 5-sentence abstract -- [Gopen & Swan: Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) - 7 reader expectation principles -- [Lipton: Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) - Word choice -- [Perez: Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/) - Micro-level clarity - -**APIs:** [Semantic Scholar](https://api.semanticscholar.org/api-docs/) | [CrossRef](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) | [arXiv](https://info.arxiv.org/help/api/basics.html) - -**Venues:** [NeurIPS](https://neurips.cc/Conferences/2025/PaperInformation/StyleFiles) | [ICML](https://icml.cc/Conferences/2025/AuthorInstructions) | [ICLR](https://iclr.cc/Conferences/2026/AuthorGuide) | [ACL](https://github.com/acl-org/acl-style-files) - diff --git a/skills/research/research-paper-writing/SKILL.md b/skills/research/research-paper-writing/SKILL.md new file mode 100644 index 000000000..16dcb8ac2 --- /dev/null +++ b/skills/research/research-paper-writing/SKILL.md @@ -0,0 +1,1599 @@ +--- +name: research-paper-writing +title: Research Paper Writing Pipeline +description: End-to-end pipeline for writing ML/AI research papers — from experiment design through analysis, drafting, revision, and submission. Covers NeurIPS, ICML, ICLR, ACL, AAAI, COLM. Integrates automated experiment monitoring, statistical analysis, iterative writing, and citation verification. +version: 1.0.0 +author: Orchestra Research +license: MIT +dependencies: [semanticscholar, arxiv, habanero, requests, scipy, numpy, matplotlib, SciencePlots] +platforms: [linux, macos] +metadata: + hermes: + tags: [Research, Paper Writing, Experiments, ML, AI, NeurIPS, ICML, ICLR, ACL, AAAI, COLM, LaTeX, Citations, Statistical Analysis] + category: research + related_skills: [arxiv, ml-paper-writing, subagent-driven-development, plan] + requires_toolsets: [terminal, files] + +--- + +# Research Paper Writing Pipeline + +End-to-end pipeline for producing publication-ready ML/AI research papers targeting **NeurIPS, ICML, ICLR, ACL, AAAI, and COLM**. This skill covers the full research lifecycle: experiment design, execution, monitoring, analysis, paper writing, review, revision, and submission. + +This is **not a linear pipeline** — it is an iterative loop. Results trigger new experiments. Reviews trigger new analysis. The agent must handle these feedback loops. + +``` +┌─────────────────────────────────────────────────────────────┐ +│ RESEARCH PAPER PIPELINE │ +│ │ +│ Phase 0: Project Setup ──► Phase 1: Literature Review │ +│ │ │ │ +│ ▼ ▼ │ +│ Phase 2: Experiment Phase 5: Paper Drafting ◄──┐ │ +│ Design │ │ │ +│ │ ▼ │ │ +│ ▼ Phase 6: Self-Review │ │ +│ Phase 3: Execution & & Revision ──────────┘ │ +│ Monitoring │ │ +│ │ ▼ │ +│ ▼ Phase 7: Submission │ +│ Phase 4: Analysis ─────► (feeds back to Phase 2 or 5) │ +│ │ +└─────────────────────────────────────────────────────────────┘ +``` + +--- + +## When To Use This Skill + +Use this skill when: +- **Starting a new research paper** from an existing codebase or idea +- **Designing and running experiments** to support paper claims +- **Writing or revising** any section of a research paper +- **Preparing for submission** to a specific conference +- **Responding to reviews** with additional experiments or revisions +- **Converting** a paper between conference formats + +## Core Philosophy + +1. **Be proactive.** Deliver complete drafts, not questions. Scientists are busy — produce something concrete they can react to, then iterate. +2. **Never hallucinate citations.** AI-generated citations have ~40% error rate. Always fetch programmatically. Mark unverifiable citations as `[CITATION NEEDED]`. +3. **Paper is a story, not a collection of experiments.** Every paper needs one clear contribution stated in a single sentence. If you can't do that, the paper isn't ready. +4. **Experiments serve claims.** Every experiment must explicitly state which claim it supports. Never run experiments that don't connect to the paper's narrative. +5. **Commit early, commit often.** Every completed experiment batch, every paper draft update — commit with descriptive messages. Git log is the experiment history. + +### Proactivity and Collaboration + +**Default: Be proactive. Draft first, ask with the draft.** + +| Confidence Level | Action | +|-----------------|--------| +| **High** (clear repo, obvious contribution) | Write full draft, deliver, iterate on feedback | +| **Medium** (some ambiguity) | Write draft with flagged uncertainties, continue | +| **Low** (major unknowns) | Ask 1-2 targeted questions via `clarify`, then draft | + +| Section | Draft Autonomously? | Flag With Draft | +|---------|-------------------|-----------------| +| Abstract | Yes | "Framed contribution as X — adjust if needed" | +| Introduction | Yes | "Emphasized problem Y — correct if wrong" | +| Methods | Yes | "Included details A, B, C — add missing pieces" | +| Experiments | Yes | "Highlighted results 1, 2, 3 — reorder if needed" | +| Related Work | Yes | "Cited papers X, Y, Z — add any I missed" | + +**Block for input only when**: target venue unclear, multiple contradictory framings, results seem incomplete, explicit request to review first. + +--- + +## Phase 0: Project Setup + +**Goal**: Establish the workspace, understand existing work, identify the contribution. + +### Step 0.1: Explore the Repository + +```bash +# Understand project structure +ls -la +find . -name "*.py" | head -30 +find . -name "*.md" -o -name "*.txt" | xargs grep -l -i "result\|conclusion\|finding" +``` + +Look for: +- `README.md` — project overview and claims +- `results/`, `outputs/`, `experiments/` — existing findings +- `configs/` — experimental settings +- `.bib` files — existing citations +- Draft documents or notes + +### Step 0.2: Organize the Workspace + +Establish a consistent workspace structure: + +``` +workspace/ + paper/ # LaTeX source, figures, compiled PDFs + experiments/ # Experiment runner scripts + code/ # Core method implementation + results/ # Raw experiment results (auto-generated) + tasks/ # Task/benchmark definitions + human_eval/ # Human evaluation materials (if needed) +``` + +### Step 0.3: Set Up Version Control + +```bash +git init # if not already +git remote add origin +git checkout -b paper-draft # or main +``` + +**Git discipline**: Every completed experiment batch gets committed with a descriptive message. Example: +``` +Add Monte Carlo constrained results (5 runs, Sonnet 4.6, policy memo task) +Add Haiku baseline comparison: autoreason vs refinement baselines at cheap model tier +``` + +### Step 0.4: Identify the Contribution + +Before writing anything, articulate: +- **The What**: What is the single thing this paper contributes? +- **The Why**: What evidence supports it? +- **The So What**: Why should readers care? + +> Propose to the scientist: "Based on my understanding, the main contribution is: [one sentence]. The key results show [Y]. Is this the framing you want?" + +### Step 0.5: Create a TODO List + +Use the `todo` tool to create a structured project plan: + +``` +Research Paper TODO: +- [ ] Define one-sentence contribution +- [ ] Literature review (related work + baselines) +- [ ] Design core experiments +- [ ] Run experiments +- [ ] Analyze results +- [ ] Write first draft +- [ ] Self-review (simulate reviewers) +- [ ] Revise based on review +- [ ] Submission prep +``` + +Update this throughout the project. It serves as the persistent state across sessions. + +--- + +## Phase 1: Literature Review + +**Goal**: Find related work, identify baselines, gather citations. + +### Step 1.1: Identify Seed Papers + +Start from papers already referenced in the codebase: + +```bash +# Via terminal: +grep -r "arxiv\|doi\|cite" --include="*.md" --include="*.bib" --include="*.py" +find . -name "*.bib" +``` + +### Step 1.2: Search for Related Work + +**Load the `arxiv` skill** for structured paper discovery: `skill_view("arxiv")`. It provides arXiv REST API search, Semantic Scholar citation graphs, author profiles, and BibTeX generation. + +Use `web_search` for broad discovery, `web_extract` for fetching specific papers: + +``` +# Via web_search: +web_search("[main technique] + [application domain] site:arxiv.org") +web_search("[baseline method] comparison ICML NeurIPS 2024") + +# Via web_extract (for specific papers): +web_extract("https://arxiv.org/abs/2303.17651") +``` + +Additional search queries to try: + +``` +Search queries: +- "[main technique] + [application domain]" +- "[baseline method] comparison" +- "[problem name] state-of-the-art" +- Author names from existing citations +``` + +**Recommended**: Install **Exa MCP** for real-time academic search: +```bash +claude mcp add exa -- npx -y mcp-remote "https://mcp.exa.ai/mcp" +``` + +### Step 1.3: Verify Every Citation + +**NEVER generate BibTeX from memory. ALWAYS fetch programmatically.** + +For each citation, follow the mandatory 5-step process: + +``` +Citation Verification (MANDATORY per citation): +1. SEARCH → Query Semantic Scholar or Exa MCP with specific keywords +2. VERIFY → Confirm paper exists in 2+ sources (Semantic Scholar + arXiv/CrossRef) +3. RETRIEVE → Get BibTeX via DOI content negotiation (programmatically, not from memory) +4. VALIDATE → Confirm the claim you're citing actually appears in the paper +5. ADD → Add verified BibTeX to bibliography +If ANY step fails → mark as [CITATION NEEDED], inform scientist +``` + +```python +# Fetch BibTeX via DOI +import requests + +def doi_to_bibtex(doi: str) -> str: + response = requests.get( + f"https://doi.org/{doi}", + headers={"Accept": "application/x-bibtex"} + ) + response.raise_for_status() + return response.text +``` + +If you cannot verify a citation: + +```latex +\cite{PLACEHOLDER_author2024_verify_this} % TODO: Verify this citation exists +``` + +**Always tell the scientist**: "I've marked [X] citations as placeholders that need verification." + +See [references/citation-workflow.md](references/citation-workflow.md) for complete API documentation and the full `CitationManager` class. + +### Step 1.4: Organize Related Work + +Group papers by methodology, not paper-by-paper: + +**Good**: "One line of work uses X's assumption [refs] whereas we use Y's assumption because..." +**Bad**: "Smith et al. introduced X. Jones et al. introduced Y. We combine both." + +--- + +## Phase 2: Experiment Design + +**Goal**: Design experiments that directly support paper claims. Every experiment must answer a specific question. + +### Step 2.1: Map Claims to Experiments + +Create an explicit mapping: + +| Claim | Experiment | Expected Evidence | +|-------|-----------|-------------------| +| "Our method outperforms baselines" | Main comparison (Table 1) | Win rate, statistical significance | +| "Effect is larger for weaker models" | Model scaling study | Monotonic improvement curve | +| "Convergence requires scope constraints" | Constrained vs unconstrained | Convergence rate comparison | + +**Rule**: If an experiment doesn't map to a claim, don't run it. + +### Step 2.2: Design Baselines + +Strong baselines are what separates accepted papers from rejected ones. Reviewers will ask: "Did they compare against X?" + +Standard baseline categories: +- **Naive baseline**: Simplest possible approach +- **Strong baseline**: Best known existing method +- **Ablation baselines**: Your method minus one component +- **Compute-matched baselines**: Same compute budget, different allocation + +### Step 2.3: Define Evaluation Protocol + +Before running anything, specify: +- **Metrics**: What you're measuring, direction symbols (higher/lower better) +- **Aggregation**: How results are combined across runs/tasks +- **Statistical tests**: What tests will establish significance +- **Sample sizes**: How many runs/problems/tasks + +### Step 2.4: Write Experiment Scripts + +Follow these patterns from successful research pipelines: + +**Incremental saving** — save results after each step for crash recovery: +```python +# Save after each problem/task +result_path = f"results/{task}/{strategy}/result.json" +if os.path.exists(result_path): + continue # Skip already-completed work +# ... run experiment ... +with open(result_path, 'w') as f: + json.dump(result, f, indent=2) +``` + +**Artifact preservation** — save all intermediate outputs: +``` +results// + / + / + final_output.md # Final result + history.json # Full trajectory + pass_01/ # Per-iteration artifacts + version_a.md + version_b.md + critic.md +``` + +**Separation of concerns** — keep generation, evaluation, and visualization separate: +``` +run_experiment.py # Core experiment runner +run_baselines.py # Baseline comparison +run_comparison_judge.py # Blind evaluation +analyze_results.py # Statistical analysis +make_charts.py # Visualization +``` + +See [references/experiment-patterns.md](references/experiment-patterns.md) for complete design patterns, cron monitoring, and error recovery. + +--- + +## Phase 3: Experiment Execution & Monitoring + +**Goal**: Run experiments reliably, monitor progress, recover from failures. + +### Step 3.1: Launch Experiments + +Use `nohup` for long-running experiments: + +```bash +nohup python run_experiment.py --config config.yaml > logs/experiment_01.log 2>&1 & +echo $! # Record the PID +``` + +**Parallel execution**: Run independent experiments simultaneously, but be aware of API rate limits. 4+ concurrent experiments on the same API will slow each down. + +### Step 3.2: Set Up Monitoring (Cron Pattern) + +For long-running experiments, set up periodic status checks. The cron prompt should follow this template: + +``` +Monitor Prompt Template: +1. Check if process is still running: ps aux | grep +2. Read last 30 lines of log: tail -30 +3. Check for completed results: ls +4. If results exist, read and report: cat +5. If all done, commit: git add -A && git commit -m "" && git push +6. Report in structured format (tables with key metrics) +7. Answer the key analytical question for this experiment +``` + +**Silent mode**: If nothing has changed since the last check, respond with `[SILENT]` to suppress notification to the user. Only report when there's news. + +### Step 3.3: Handle Failures + +Common failure modes and recovery: + +| Failure | Detection | Recovery | +|---------|-----------|----------| +| API rate limit / credit exhaustion | 402/429 errors in logs | Wait, then re-run (scripts skip completed work) | +| Process crash | PID gone, incomplete results | Re-run from last checkpoint | +| Timeout on hard problems | Process stuck, no log progress | Kill and skip, note in results | +| Wrong model ID | Errors referencing model name | Fix ID and re-run | + +**Key**: Scripts should always check for existing results and skip completed work. This makes re-runs safe and efficient. + +### Step 3.4: Commit Completed Results + +After each experiment batch completes: + +```bash +git add -A +git commit -m "Add : " +git push +``` + +--- + +## Phase 4: Result Analysis + +**Goal**: Extract findings, compute statistics, identify the story. + +### Step 4.1: Aggregate Results + +Write analysis scripts that: +1. Load all result files from a batch +2. Compute per-task and aggregate metrics +3. Generate summary tables + +```python +# Standard analysis pattern +import json, os +from pathlib import Path + +results = {} +for result_file in Path("results/").rglob("result.json"): + data = json.loads(result_file.read_text()) + strategy = result_file.parent.name + task = result_file.parent.parent.name + results.setdefault(strategy, {})[task] = data + +# Compute aggregate metrics +for strategy, tasks in results.items(): + scores = [t["score"] for t in tasks.values()] + print(f"{strategy}: mean={np.mean(scores):.1f}, std={np.std(scores):.1f}") +``` + +### Step 4.2: Statistical Significance + +Always compute: +- **Error bars**: Standard deviation or standard error, specify which +- **Confidence intervals**: 95% CI for key results +- **Pairwise tests**: McNemar's test for comparing two methods +- **Effect sizes**: Cohen's d or h for practical significance + +See [references/experiment-patterns.md](references/experiment-patterns.md) for complete implementations of McNemar's test, bootstrapped CIs, and Cohen's h. + +### Step 4.3: Identify the Story + +After analysis, explicitly answer: +1. **What is the main finding?** State it in one sentence. +2. **What surprised you?** Unexpected results often make the best papers. +3. **What failed?** Failed experiments can be the most informative. Honest reporting of failures strengthens the paper. +4. **What follow-up experiments are needed?** Results often raise new questions. + +### Step 4.4: Create Figures and Tables + +**Figures**: +- Use vector graphics (PDF) for all plots: `plt.savefig('fig.pdf')` +- Colorblind-safe palettes (Okabe-Ito or Paul Tol) +- Self-contained captions — reader should understand without main text +- No title inside figure — the caption serves this function + +**Tables**: +- Use `booktabs` LaTeX package +- Bold best value per metric +- Include direction symbols (higher/lower better) +- Consistent decimal precision + +```latex +\usepackage{booktabs} +\begin{tabular}{lcc} +\toprule +Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\ +\midrule +Baseline & 85.2 & 45ms \\ +\textbf{Ours} & \textbf{92.1} & 38ms \\ +\bottomrule +\end{tabular} +``` + +### Step 4.5: Decide: More Experiments or Write? + +| Situation | Action | +|-----------|--------| +| Core claims supported, results significant | Move to Phase 5 (writing) | +| Results inconclusive, need more data | Back to Phase 2 (design) | +| Unexpected finding suggests new direction | Back to Phase 2 (design) | +| Missing one ablation reviewers will ask for | Run it, then Phase 5 | +| All experiments done but some failed | Note failures, move to Phase 5 | + +--- + +## Iterative Refinement: Strategy Selection + +Any output in this pipeline — paper drafts, experiment scripts, analysis — can be iteratively refined. The autoreason research provides empirical evidence for when each refinement strategy works and when it fails. Use this section to choose the right approach. + +### Quick Decision Table + +| Your Situation | Strategy | Why | +|---------------|----------|-----| +| Mid-tier model + constrained task | **Autoreason** | Sweet spot. Generation-evaluation gap is widest. Baselines actively destroy weak model outputs. | +| Mid-tier model + open task | **Autoreason** with scope constraints added | Add fixed facts, structure, or deliverable to bound the improvement space. | +| Frontier model + constrained task | **Autoreason** | Wins 2/3 constrained tasks even at frontier. | +| Frontier model + unconstrained task | **Critique-and-revise** or **single pass** | Autoreason comes last. Model self-evaluates well enough. | +| Concrete technical task (system design) | **Critique-and-revise** | Direct find-and-fix loop is more efficient. | +| Template-filling task (one correct structure) | **Single pass** or **conservative** | Minimal decision space. Iteration adds no value. | +| Code with test cases | **Autoreason (code variant)** | Structured analysis of *why* it failed before fixing. Recovery rate 62% vs 43%. | +| Very weak model (Llama 8B class) | **Single pass** | Model too weak for diverse candidates. Invest in generation quality. | + +### The Generation-Evaluation Gap + +**Core insight**: Autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability. + +``` +Model Tier │ Generation │ Self-Eval │ Gap │ Autoreason Value +──────────────────┼────────────┼───────────┼────────┼───────────────── +Weak (Llama 8B) │ Poor │ Poor │ Small │ None — can't generate diverse candidates +Mid (Haiku 3.5) │ Decent │ Poor │ LARGE │ MAXIMUM — 42/42 perfect Borda +Mid (Gemini Flash)│ Decent │ Moderate │ Large │ High — wins 2/3 +Strong (Sonnet 4) │ Good │ Decent │ Medium │ Moderate — wins 3/5 +Frontier (S4.6) │ Excellent │ Good │ Small │ Only with constraints +``` + +This gap is structural, not temporary. As costs drop, today's frontier becomes tomorrow's mid-tier. The sweet spot moves but never disappears. + +### Autoreason Loop (Summary) + +Each pass produces three candidates from fresh, isolated agents: + +1. **Critic** → finds problems in incumbent A (no fixes) +2. **Author B** → revises A based on critique +3. **Synthesizer** → merges A and B (randomized labels) +4. **Judge Panel** → 3 blind CoT judges rank A, B, AB via Borda count +5. **Convergence** → A wins k=2 consecutive passes → done + +**Key parameters:** +- k=2 convergence (k=1 premature, k=3 too expensive, no quality gain) +- CoT judges always (3x faster convergence) +- Temperature 0.8 authors, 0.3 judges +- Conservative tiebreak: incumbent wins ties +- Every role is a fresh agent with no shared context + +### Applying to Paper Drafts + +When refining the paper itself through autoreason: +- **Provide ground truth to the critic**: actual experimental data, result JSONs, statistical outputs. Without this, models hallucinate fabricated ablation studies and fake confidence intervals. +- **Use 3 working judges minimum**: A broken judge parser doesn't add noise — it prevents equilibrium entirely. +- **Scope constrain the revision**: "Address these specific weaknesses" not "improve the paper." + +### Failure Modes + +| Failure | Detection | Fix | +|---------|-----------|-----| +| No convergence (A never wins) | A wins <15% over 20+ passes | Add scope constraints to the task | +| Synthesis drift | Word counts grow unboundedly | Constrain structure and deliverable | +| Degradation below single pass | Baselines score higher than iterated output | Switch to single pass; model may be too weak | +| Overfitting (code) | High public-test pass, low private-test pass | Use structured analysis, not just test feedback | +| Broken judges | Parsing failures reduce panel below 3 | Fix parser before continuing | + +See [references/autoreason-methodology.md](references/autoreason-methodology.md) for complete prompts, Borda scoring details, model selection guide, scope constraint design patterns, and compute budget reference. + +--- + +## Phase 5: Paper Drafting + +**Goal**: Write a complete, publication-ready paper. + +### The Narrative Principle + +**The single most critical insight**: Your paper is not a collection of experiments — it's a story with one clear contribution supported by evidence. + +Every successful ML paper centers on what Neel Nanda calls "the narrative": a short, rigorous, evidence-based technical story with a takeaway readers care about. + +**Three Pillars (must be crystal clear by end of introduction):** + +| Pillar | Description | Test | +|--------|-------------|------| +| **The What** | 1-3 specific novel claims | Can you state them in one sentence? | +| **The Why** | Rigorous empirical evidence | Do experiments distinguish your hypothesis from alternatives? | +| **The So What** | Why readers should care | Does this connect to a recognized community problem? | + +**If you cannot state your contribution in one sentence, you don't yet have a paper.** + +### Time Allocation + +Spend approximately **equal time** on each of: +1. The abstract +2. The introduction +3. The figures +4. Everything else combined + +**Why?** Most reviewers form judgments before reaching your methods. Readers encounter your paper as: title → abstract → introduction → figures → maybe the rest. + +### Writing Workflow + +``` +Paper Writing Checklist: +- [ ] Step 1: Define the one-sentence contribution +- [ ] Step 2: Draft Figure 1 (core idea or most compelling result) +- [ ] Step 3: Draft abstract (5-sentence formula) +- [ ] Step 4: Draft introduction (1-1.5 pages max) +- [ ] Step 5: Draft methods +- [ ] Step 6: Draft experiments & results +- [ ] Step 7: Draft related work +- [ ] Step 8: Draft conclusion & discussion +- [ ] Step 9: Draft limitations (REQUIRED by all venues) +- [ ] Step 10: Plan appendix (proofs, extra experiments, details) +- [ ] Step 11: Complete paper checklist +- [ ] Step 12: Final review +``` + +### Step 5.0: Title + +The title is the single most-read element of the paper. It determines whether anyone clicks through to the abstract. + +**Good titles**: +- State the contribution or finding: "Autoreason: When Iterative LLM Refinement Works and Why It Fails" +- Highlight a surprising result: "Scaling Data-Constrained Language Models" (implies you can) +- Name the method + what it does: "DPO: Direct Preference Optimization of Language Models" + +**Bad titles**: +- Too generic: "An Approach to Improving Language Model Outputs" +- Too long: anything over ~15 words +- Jargon-only: "Asymptotic Convergence of Iterative Stochastic Policy Refinement" (who is this for?) + +**Rules**: +- Include your method name if you have one (for citability) +- Include 1-2 keywords reviewers will search for +- Avoid colons unless both halves carry meaning +- Test: would a reviewer know the domain and contribution from the title alone? + +### Step 5.1: Abstract (5-Sentence Formula) + +From Sebastian Farquhar (DeepMind): + +``` +1. What you achieved: "We introduce...", "We prove...", "We demonstrate..." +2. Why this is hard and important +3. How you do it (with specialist keywords for discoverability) +4. What evidence you have +5. Your most remarkable number/result +``` + +**Delete** generic openings like "Large language models have achieved remarkable success..." + +### Step 5.2: Figure 1 + +Figure 1 is the second thing most readers look at (after abstract). Draft it before writing the introduction — it forces you to clarify the core idea. + +| Figure 1 Type | When to Use | Example | +|---------------|-------------|---------| +| **Method diagram** | New architecture or pipeline | TikZ flowchart showing your system | +| **Results teaser** | One compelling result tells the whole story | Bar chart: "Ours vs baselines" with clear gap | +| **Problem illustration** | The problem is unintuitive | Before/after showing failure mode you fix | +| **Conceptual diagram** | Abstract contribution needs visual grounding | 2x2 matrix of method properties | + +**Rules**: Figure 1 must be understandable without reading any text. The caption alone should communicate the core idea. Use color purposefully — don't just decorate. + +### Step 5.3: Introduction (1-1.5 pages max) + +Must include: +- Clear problem statement +- Brief approach overview +- 2-4 bullet contribution list (max 1-2 lines each in two-column format) +- Methods should start by page 2-3 + +### Step 5.3: Methods + +Enable reimplementation: +- Conceptual outline or pseudocode +- All hyperparameters listed +- Architectural details sufficient for reproduction +- Present final design decisions; ablations go in experiments + +### Step 5.4: Experiments & Results + +For each experiment, explicitly state: +- **What claim it supports** +- How it connects to main contribution +- What to observe: "the blue line shows X, which demonstrates Y" + +Requirements: +- Error bars with methodology (std dev vs std error) +- Hyperparameter search ranges +- Compute infrastructure (GPU type, total hours) +- Seed-setting methods + +### Step 5.5: Related Work + +Organize methodologically, not paper-by-paper. Cite generously — reviewers likely authored relevant papers. + +### Step 5.6: Limitations (REQUIRED) + +All major conferences require this. Honesty helps: +- Reviewers are instructed not to penalize honest limitation acknowledgment +- Pre-empt criticisms by identifying weaknesses first +- Explain why limitations don't undermine core claims + +### Step 5.7: Conclusion & Discussion + +**Conclusion** (required, 0.5-1 page): +- Restate the contribution in one sentence (different wording from abstract) +- Summarize key findings (2-3 sentences, not a list) +- Implications: what does this mean for the field? +- Future work: 2-3 concrete next steps (not vague "we leave X for future work") + +**Discussion** (optional, sometimes combined with conclusion): +- Broader implications beyond immediate results +- Connections to other subfields +- Honest assessment of when the method does and doesn't work +- Practical deployment considerations + +**Do NOT** introduce new results or claims in the conclusion. + +### Step 5.8: Appendix Strategy + +Appendices are unlimited at all major venues and are essential for reproducibility. Structure: + +| Appendix Section | What Goes Here | +|-----------------|---------------| +| **Proofs & Derivations** | Full proofs too long for main text. Main text can state theorems with "proof in Appendix A." | +| **Additional Experiments** | Ablations, scaling curves, per-dataset breakdowns, hyperparameter sensitivity | +| **Implementation Details** | Full hyperparameter tables, training details, hardware specs, random seeds | +| **Dataset Documentation** | Data collection process, annotation guidelines, licensing, preprocessing | +| **Prompts & Templates** | Exact prompts used (for LLM-based methods), evaluation templates | +| **Human Evaluation** | Annotation interface screenshots, instructions given to annotators, IRB details | +| **Additional Figures** | Per-task breakdowns, trajectory visualizations, failure case examples | + +**Rules**: +- The main paper must be self-contained — reviewers are not required to read appendices +- Never put critical evidence only in the appendix +- Cross-reference: "Full results in Table 5 (Appendix B)" not just "see appendix" +- Use `\appendix` command, then `\section{A: Proofs}` etc. + +### Page Budget Management + +When over the page limit: + +| Cut Strategy | Saves | Risk | +|-------------|-------|------| +| Move proofs to appendix | 0.5-2 pages | Low — standard practice | +| Condense related work | 0.5-1 page | Medium — may miss key citations | +| Combine tables with subfigures | 0.25-0.5 page | Low — often improves readability | +| Use `\vspace{-Xpt}` sparingly | 0.1-0.3 page | Low if subtle, high if obvious | +| Remove qualitative examples | 0.5-1 page | Medium — reviewers like examples | +| Reduce figure sizes | 0.25-0.5 page | High — figures must remain readable | + +**Do NOT**: reduce font size, change margins, remove required sections (limitations, broader impact), or use `\small`/`\footnotesize` for main text. + +### Writing Style + +**Sentence-level clarity (Gopen & Swan's 7 Principles):** + +| Principle | Rule | +|-----------|------| +| Subject-verb proximity | Keep subject and verb close | +| Stress position | Place emphasis at sentence ends | +| Topic position | Put context first, new info after | +| Old before new | Familiar info → unfamiliar info | +| One unit, one function | Each paragraph makes one point | +| Action in verb | Use verbs, not nominalizations | +| Context before new | Set stage before presenting | + +**Word choice (Lipton, Steinhardt):** +- Be specific: "accuracy" not "performance" +- Eliminate hedging: drop "may" unless genuinely uncertain +- Consistent terminology throughout +- Avoid incremental vocabulary: "develop", not "combine" + +**Full writing guide with examples**: See [references/writing-guide.md](references/writing-guide.md) + +### Using LaTeX Templates + +**Always copy the entire template directory first, then write within it.** + +``` +Template Setup Checklist: +- [ ] Step 1: Copy entire template directory to new project +- [ ] Step 2: Verify template compiles as-is (before any changes) +- [ ] Step 3: Read the template's example content to understand structure +- [ ] Step 4: Replace example content section by section +- [ ] Step 5: Use template macros (check preamble for \newcommand definitions) +- [ ] Step 6: Clean up template artifacts only at the end +``` + +**Step 1: Copy the Full Template** + +```bash +cp -r templates/neurips2025/ ~/papers/my-paper/ +cd ~/papers/my-paper/ +ls -la # Should see: main.tex, neurips.sty, Makefile, etc. +``` + +Copy the ENTIRE directory, not just the .tex file. Templates include style files (.sty), bibliography styles (.bst), example content, and Makefiles. + +**Step 2: Verify Template Compiles First** + +Before making ANY changes: +```bash +latexmk -pdf main.tex +# Or manual: pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex +``` + +If the unmodified template doesn't compile, fix that first (usually missing TeX packages — install via `tlmgr install `). + +**Step 3: Keep Template Content as Reference** + +Don't immediately delete example content. Comment it out and use as formatting reference: +```latex +% Template example (keep for reference): +% \begin{figure}[t] +% \centering +% \includegraphics[width=0.8\linewidth]{example-image} +% \caption{Template shows caption style} +% \end{figure} + +% Your actual figure: +\begin{figure}[t] + \centering + \includegraphics[width=0.8\linewidth]{your-figure.pdf} + \caption{Your caption following the same style.} +\end{figure} +``` + +**Step 4: Replace Content Section by Section** + +Work through systematically: title/authors → abstract → introduction → methods → experiments → related work → conclusion → references → appendix. Compile after each section. + +**Step 5: Use Template Macros** + +```latex +\newcommand{\method}{YourMethodName} % Consistent method naming +\newcommand{\eg}{e.g.,\xspace} % Proper abbreviations +\newcommand{\ie}{i.e.,\xspace} +``` + +### Template Pitfalls + +| Pitfall | Problem | Solution | +|---------|---------|----------| +| Copying only `.tex` file | Missing `.sty`, won't compile | Copy entire directory | +| Modifying `.sty` files | Breaks conference formatting | Never edit style files | +| Adding random packages | Conflicts, breaks template | Only add if necessary | +| Deleting template content early | Lose formatting reference | Keep as comments until done | +| Not compiling frequently | Errors accumulate | Compile after each section | +| Raster PNGs for figures | Blurry in paper | Always use vector PDF via `savefig('fig.pdf')` | + +### Quick Template Reference + +| Conference | Main File | Style File | Page Limit | +|------------|-----------|------------|------------| +| NeurIPS 2025 | `main.tex` | `neurips.sty` | 9 pages | +| ICML 2026 | `example_paper.tex` | `icml2026.sty` | 8 pages | +| ICLR 2026 | `iclr2026_conference.tex` | `iclr2026_conference.sty` | 9 pages | +| ACL 2025 | `acl_latex.tex` | `acl.sty` | 8 pages (long) | +| AAAI 2026 | `aaai2026-unified-template.tex` | `aaai2026.sty` | 7 pages | +| COLM 2025 | `colm2025_conference.tex` | `colm2025_conference.sty` | 9 pages | + +**Universal**: Double-blind, references don't count, appendices unlimited, LaTeX required. + +Templates in `templates/` directory. See [templates/README.md](templates/README.md) for compilation setup (VS Code, CLI, Overleaf, other IDEs). + +### Tables and Figures + +**Tables** — use `booktabs` for professional formatting: + +```latex +\usepackage{booktabs} +\begin{tabular}{lcc} +\toprule +Method & Accuracy $\uparrow$ & Latency $\downarrow$ \\ +\midrule +Baseline & 85.2 & 45ms \\ +\textbf{Ours} & \textbf{92.1} & 38ms \\ +\bottomrule +\end{tabular} +``` + +Rules: +- Bold best value per metric +- Include direction symbols ($\uparrow$ higher better, $\downarrow$ lower better) +- Right-align numerical columns +- Consistent decimal precision + +**Figures**: +- **Vector graphics** (PDF, EPS) for all plots and diagrams — `plt.savefig('fig.pdf')` +- **Raster** (PNG 600 DPI) only for photographs +- **Colorblind-safe palettes** (Okabe-Ito or Paul Tol) +- Verify **grayscale readability** (8% of men have color vision deficiency) +- **No title inside figure** — the caption serves this function +- **Self-contained captions** — reader should understand without main text + +### Conference Resubmission + +For converting between venues, see Phase 7 (Submission Preparation) — it covers the full conversion workflow, page-change table, and post-rejection guidance. + +### Professional LaTeX Preamble + +Add these packages to any paper for professional quality. They are compatible with all major conference style files: + +```latex +% --- Professional Packages (add after conference style file) --- + +% Typography +\usepackage{microtype} % Microtypographic improvements (protrusion, expansion) + % Makes text noticeably more polished — always include + +% Tables +\usepackage{booktabs} % Professional table rules (\toprule, \midrule, \bottomrule) +\usepackage{siunitx} % Consistent number formatting, decimal alignment + % Usage: \num{12345} → 12,345; \SI{3.5}{GHz} → 3.5 GHz + % Table alignment: S column type for decimal-aligned numbers + +% Figures +\usepackage{graphicx} % Include graphics (\includegraphics) +\usepackage{subcaption} % Subfigures with (a), (b), (c) labels + % Usage: \begin{subfigure}{0.48\textwidth} ... \end{subfigure} + +% Diagrams and Algorithms +\usepackage{tikz} % Programmable vector diagrams +\usetikzlibrary{arrows.meta, positioning, shapes.geometric, calc, fit, backgrounds} +\usepackage[ruled,vlined]{algorithm2e} % Professional pseudocode + % Alternative: \usepackage{algorithmicx} if template bundles it + +% Cross-references +\usepackage{cleveref} % Smart references: \cref{fig:x} → "Figure 1" + % MUST be loaded AFTER hyperref + % Handles: figures, tables, sections, equations, algorithms + +% Math (usually included by conference .sty, but verify) +\usepackage{amsmath,amssymb} % AMS math environments and symbols +\usepackage{mathtools} % Extends amsmath (dcases, coloneqq, etc.) + +% Colors (for figures and diagrams) +\usepackage{xcolor} % Color management +% Okabe-Ito colorblind-safe palette: +\definecolor{okblue}{HTML}{0072B2} +\definecolor{okorange}{HTML}{E69F00} +\definecolor{okgreen}{HTML}{009E73} +\definecolor{okred}{HTML}{D55E00} +\definecolor{okpurple}{HTML}{CC79A7} +\definecolor{okcyan}{HTML}{56B4E9} +\definecolor{okyellow}{HTML}{F0E442} +``` + +**Notes:** +- `microtype` is the single highest-impact package for visual quality. It adjusts character spacing at a sub-pixel level. Always include it. +- `siunitx` handles decimal alignment in tables via the `S` column type — eliminates manual spacing. +- `cleveref` must be loaded **after** `hyperref`. Most conference .sty files load hyperref, so put cleveref last. +- Check if the conference template already loads any of these (especially `algorithm`, `amsmath`, `graphicx`). Don't double-load. + +### siunitx Table Alignment + +`siunitx` makes number-heavy tables significantly more readable: + +```latex +\begin{tabular}{l S[table-format=2.1] S[table-format=2.1] S[table-format=2.1]} +\toprule +Method & {Accuracy $\uparrow$} & {F1 $\uparrow$} & {Latency (ms) $\downarrow$} \\ +\midrule +Baseline & 85.2 & 83.7 & 45.3 \\ +Ablation (no X) & 87.1 & 85.4 & 42.1 \\ +\textbf{Ours} & \textbf{92.1} & \textbf{90.8} & \textbf{38.7} \\ +\bottomrule +\end{tabular} +``` + +The `S` column type auto-aligns on the decimal point. Headers in `{}` escape the alignment. + +### Subfigures + +Standard pattern for side-by-side figures: + +```latex +\begin{figure}[t] + \centering + \begin{subfigure}[b]{0.48\textwidth} + \centering + \includegraphics[width=\textwidth]{fig_results_a.pdf} + \caption{Results on Dataset A.} + \label{fig:results-a} + \end{subfigure} + \hfill + \begin{subfigure}[b]{0.48\textwidth} + \centering + \includegraphics[width=\textwidth]{fig_results_b.pdf} + \caption{Results on Dataset B.} + \label{fig:results-b} + \end{subfigure} + \caption{Comparison of our method across two datasets. (a) shows the scaling + behavior and (b) shows the ablation results. Both use 5 random seeds.} + \label{fig:results} +\end{figure} +``` + +Use `\cref{fig:results}` → "Figure 1", `\cref{fig:results-a}` → "Figure 1a". + +### Pseudocode with algorithm2e + +```latex +\begin{algorithm}[t] +\caption{Iterative Refinement with Judge Panel} +\label{alg:method} +\KwIn{Task $T$, model $M$, judges $J_1 \ldots J_n$, convergence threshold $k$} +\KwOut{Final output $A^*$} +$A \gets M(T)$ \tcp*{Initial generation} +$\text{streak} \gets 0$\; +\While{$\text{streak} < k$}{ + $C \gets \text{Critic}(A, T)$ \tcp*{Identify weaknesses} + $B \gets M(T, C)$ \tcp*{Revised version addressing critique} + $AB \gets \text{Synthesize}(A, B)$ \tcp*{Merge best elements} + \ForEach{judge $J_i$}{ + $\text{rank}_i \gets J_i(\text{shuffle}(A, B, AB))$ \tcp*{Blind ranking} + } + $\text{winner} \gets \text{BordaCount}(\text{ranks})$\; + \eIf{$\text{winner} = A$}{ + $\text{streak} \gets \text{streak} + 1$\; + }{ + $A \gets \text{winner}$; $\text{streak} \gets 0$\; + } +} +\Return{$A$}\; +\end{algorithm} +``` + +### TikZ Diagram Patterns + +TikZ is the standard for method diagrams in ML papers. Common patterns: + +**Pipeline/Flow Diagram** (most common in ML papers): + +```latex +\begin{figure}[t] +\centering +\begin{tikzpicture}[ + node distance=1.8cm, + box/.style={rectangle, draw, rounded corners, minimum height=1cm, + minimum width=2cm, align=center, font=\small}, + arrow/.style={-{Stealth[length=3mm]}, thick}, +] + \node[box, fill=okcyan!20] (input) {Input\\$x$}; + \node[box, fill=okblue!20, right of=input] (encoder) {Encoder\\$f_\theta$}; + \node[box, fill=okgreen!20, right of=encoder] (latent) {Latent\\$z$}; + \node[box, fill=okorange!20, right of=latent] (decoder) {Decoder\\$g_\phi$}; + \node[box, fill=okred!20, right of=decoder] (output) {Output\\$\hat{x}$}; + + \draw[arrow] (input) -- (encoder); + \draw[arrow] (encoder) -- (latent); + \draw[arrow] (latent) -- (decoder); + \draw[arrow] (decoder) -- (output); +\end{tikzpicture} +\caption{Architecture overview. The encoder maps input $x$ to latent +representation $z$, which the decoder reconstructs.} +\label{fig:architecture} +\end{figure} +``` + +**Comparison/Matrix Diagram** (for showing method variants): + +```latex +\begin{tikzpicture}[ + cell/.style={rectangle, draw, minimum width=2.5cm, minimum height=1cm, + align=center, font=\small}, + header/.style={cell, fill=gray!20, font=\small\bfseries}, +] + % Headers + \node[header] at (0, 0) {Method}; + \node[header] at (3, 0) {Converges?}; + \node[header] at (6, 0) {Quality?}; + % Rows + \node[cell] at (0, -1) {Single Pass}; + \node[cell, fill=okgreen!15] at (3, -1) {N/A}; + \node[cell, fill=okorange!15] at (6, -1) {Baseline}; + \node[cell] at (0, -2) {Critique+Revise}; + \node[cell, fill=okred!15] at (3, -2) {No}; + \node[cell, fill=okred!15] at (6, -2) {Degrades}; + \node[cell] at (0, -3) {Ours}; + \node[cell, fill=okgreen!15] at (3, -3) {Yes ($k$=2)}; + \node[cell, fill=okgreen!15] at (6, -3) {Improves}; +\end{tikzpicture} +``` + +**Iterative Loop Diagram** (for methods with feedback): + +```latex +\begin{tikzpicture}[ + node distance=2cm, + box/.style={rectangle, draw, rounded corners, minimum height=0.8cm, + minimum width=1.8cm, align=center, font=\small}, + arrow/.style={-{Stealth[length=3mm]}, thick}, + label/.style={font=\scriptsize, midway, above}, +] + \node[box, fill=okblue!20] (gen) {Generator}; + \node[box, fill=okred!20, right=2.5cm of gen] (critic) {Critic}; + \node[box, fill=okgreen!20, below=1.5cm of $(gen)!0.5!(critic)$] (judge) {Judge Panel}; + + \draw[arrow] (gen) -- node[label] {output $A$} (critic); + \draw[arrow] (critic) -- node[label, right] {critique $C$} (judge); + \draw[arrow] (judge) -| node[label, left, pos=0.3] {winner} (gen); +\end{tikzpicture} +``` + +### latexdiff for Revision Tracking + +Essential for rebuttals — generates a marked-up PDF showing changes between versions: + +```bash +# Install +# macOS: brew install latexdiff (or comes with TeX Live) +# Linux: sudo apt install latexdiff + +# Generate diff +latexdiff paper_v1.tex paper_v2.tex > paper_diff.tex +pdflatex paper_diff.tex + +# For multi-file projects (with \input{} or \include{}) +latexdiff --flatten paper_v1.tex paper_v2.tex > paper_diff.tex +``` + +This produces a PDF with deletions in red strikethrough and additions in blue — standard format for rebuttal supplements. + +### SciencePlots for matplotlib + +Install and use for publication-quality plots: + +```bash +pip install SciencePlots +``` + +```python +import matplotlib.pyplot as plt +import scienceplots # registers styles + +# Use science style (IEEE-like, clean) +with plt.style.context(['science', 'no-latex']): + fig, ax = plt.subplots(figsize=(3.5, 2.5)) # Single-column width + ax.plot(x, y, label='Ours', color='#0072B2') + ax.plot(x, y2, label='Baseline', color='#D55E00', linestyle='--') + ax.set_xlabel('Training Steps') + ax.set_ylabel('Accuracy') + ax.legend() + fig.savefig('paper/fig_results.pdf', bbox_inches='tight') + +# Available styles: 'science', 'ieee', 'nature', 'science+ieee' +# Add 'no-latex' if LaTeX is not installed on the machine generating plots +``` + +**Standard figure sizes** (two-column format): +- Single column: `figsize=(3.5, 2.5)` — fits in one column +- Double column: `figsize=(7.0, 3.0)` — spans both columns +- Square: `figsize=(3.5, 3.5)` — for heatmaps, confusion matrices + +--- + +## Phase 6: Self-Review & Revision + +**Goal**: Simulate the review process before submission. Catch weaknesses early. + +### Step 6.1: Simulate Reviews + +Generate reviews from multiple perspectives using strong models (Opus 4, Sonnet 4.6, Gemini 2.5 Pro). Use the reviewer guidelines from the target venue. + +**Review prompt template:** + +``` +You are an expert reviewer for [VENUE]. Review this paper according to the +official reviewer guidelines. Evaluate: + +1. Quality (technical soundness, baselines, claims supported by evidence) +2. Clarity (writing, notation consistency, reproducibility) +3. Significance (impact, importance of the problem) +4. Originality (novelty, new insights) + +Provide: +- Summary (2-3 sentences) +- Strengths (bullet list) +- Weaknesses (bullet list, most critical first) +- Questions for authors +- Missing references +- Score (1-6 on NeurIPS scale) +- Confidence (1-5) +``` + +### Step 6.2: Prioritize Feedback + +After collecting reviews, categorize: + +| Priority | Action | +|----------|--------| +| **Critical** (technical flaw, missing baseline) | Must fix. May require new experiments → back to Phase 2 | +| **High** (clarity issue, missing ablation) | Should fix in this revision | +| **Medium** (minor writing issues, extra experiments) | Fix if time allows | +| **Low** (style preferences, tangential suggestions) | Note for future work | + +### Step 6.3: Revision Cycle + +For each critical/high issue: +1. Identify the specific section(s) affected +2. Draft the fix +3. Verify the fix doesn't break other claims +4. Update the paper +5. Re-check against the reviewer's concern + +### Step 6.4: Rebuttal Writing + +When responding to actual reviews (post-submission), rebuttals are a distinct skill from revision: + +**Format**: Point-by-point. For each reviewer concern: +``` +> R1-W1: "The paper lacks comparison with Method X." + +We thank the reviewer for this suggestion. We have added a comparison with +Method X in Table 3 (revised). Our method outperforms X by 3.2pp on [metric] +(p<0.05). We note that X requires 2x our compute budget. +``` + +**Rules**: +- Address every concern — reviewers notice if you skip one +- Lead with the strongest responses +- Be concise and direct — reviewers read dozens of rebuttals +- Include new results if you ran experiments during the rebuttal period +- Never be defensive or dismissive, even of weak criticisms +- Use `latexdiff` to generate a marked-up PDF showing changes (see Professional LaTeX Tooling section) +- Thank reviewers for specific, actionable feedback (not generic praise) + +**What NOT to do**: "We respectfully disagree" without evidence. "This is out of scope" without explanation. Ignoring a weakness by only responding to strengths. + +### Step 6.5: Paper Evolution Tracking + +Save snapshots at key milestones: +``` +paper/ + paper.tex # Current working version + paper_v1_first_draft.tex # First complete draft + paper_v2_post_review.tex # After simulated review + paper_v3_pre_submission.tex # Final before submission + paper_v4_camera_ready.tex # Post-acceptance final +``` + +--- + +## Phase 7: Submission Preparation + +**Goal**: Final checks, formatting, and submission. + +### Step 7.1: Conference Checklist + +Every venue has mandatory checklists. Complete them carefully — incomplete checklists can result in desk rejection. + +See [references/checklists.md](references/checklists.md) for: +- NeurIPS 16-item paper checklist +- ICML broader impact + reproducibility +- ICLR LLM disclosure policy +- ACL mandatory limitations section +- Universal pre-submission checklist + +### Step 7.2: Anonymization Checklist + +Double-blind review means reviewers cannot know who wrote the paper. Check ALL of these: + +``` +Anonymization Checklist: +- [ ] No author names or affiliations anywhere in the PDF +- [ ] No acknowledgments section (add after acceptance) +- [ ] Self-citations written in third person: "Smith et al. [1] showed..." not "We previously showed [1]..." +- [ ] No GitHub/GitLab URLs pointing to your personal repos +- [ ] Use Anonymous GitHub (https://anonymous.4open.science/) for code links +- [ ] No institutional logos or identifiers in figures +- [ ] No file metadata containing author names (check PDF properties) +- [ ] No "our previous work" or "in our earlier paper" phrasing +- [ ] Dataset names don't reveal institution (rename if needed) +- [ ] Supplementary materials don't contain identifying information +``` + +**Common mistakes**: Git commit messages visible in supplementary code, watermarked figures from institutional tools, acknowledgments left in from a previous draft, arXiv preprint posted before anonymity period. + +### Step 7.3: Formatting Verification + +``` +Pre-Submission Format Check: +- [ ] Page limit respected (excluding references and appendix) +- [ ] All figures are vector (PDF) or high-res raster (600 DPI PNG) +- [ ] All figures readable in grayscale +- [ ] All tables use booktabs +- [ ] References compile correctly (no "?" in citations) +- [ ] No overfull hboxes in critical areas +- [ ] Appendix clearly labeled and separated +- [ ] Required sections present (limitations, broader impact, etc.) +``` + +### Step 7.3: Final Compilation + +```bash +# Clean build +rm -f *.aux *.bbl *.blg *.log *.out *.pdf +latexmk -pdf main.tex + +# Or manual +pdflatex main.tex +bibtex main +pdflatex main.tex +pdflatex main.tex +``` + +### Step 7.4: Conference-Specific Requirements + +| Venue | Special Requirements | +|-------|---------------------| +| **NeurIPS** | Paper checklist in appendix, lay summary if accepted | +| **ICML** | Broader Impact Statement (after conclusion, doesn't count toward limit) | +| **ICLR** | LLM disclosure required, reciprocal reviewing agreement | +| **ACL** | Mandatory Limitations section, Responsible NLP checklist | +| **AAAI** | Strict style file — no modifications whatsoever | +| **COLM** | Frame contribution for language model community | + +### Step 7.6: Conference Resubmission & Format Conversion + +When converting between venues, **never copy LaTeX preambles between templates**: + +```bash +# 1. Start fresh with target template +cp -r templates/icml2026/ new_submission/ + +# 2. Copy ONLY content sections (not preamble) +# - Abstract text, section content, figures, tables, bib entries + +# 3. Adjust for page limits +# 4. Add venue-specific required sections +# 5. Update references +``` + +| From → To | Page Change | Key Adjustments | +|-----------|-------------|-----------------| +| NeurIPS → ICML | 9 → 8 | Cut 1 page, add Broader Impact | +| ICML → ICLR | 8 → 9 | Expand experiments, add LLM disclosure | +| NeurIPS → ACL | 9 → 8 | Restructure for NLP conventions, add Limitations | +| ICLR → AAAI | 9 → 7 | Significant cuts, strict style adherence | +| Any → COLM | varies → 9 | Reframe for language model focus | + +When cutting pages: move proofs to appendix, condense related work, combine tables, use subfigures. +When expanding: add ablations, expand limitations, include additional baselines, add qualitative examples. + +**After rejection**: Address reviewer concerns in the new version, but don't include a "changes" section or reference the previous submission (blind review). + +### Step 7.7: Camera-Ready Preparation (Post-Acceptance) + +After acceptance, prepare the camera-ready version: + +``` +Camera-Ready Checklist: +- [ ] De-anonymize: add author names, affiliations, email addresses +- [ ] Add Acknowledgments section (funding, compute grants, helpful reviewers) +- [ ] Add public code/data URL (real GitHub, not anonymous) +- [ ] Address any mandatory revisions from meta-reviewer +- [ ] Switch template to camera-ready mode (if applicable — e.g., AAAI \anon → \camera) +- [ ] Add copyright notice if required by venue +- [ ] Update any "anonymous" placeholders in text +- [ ] Verify final PDF compiles cleanly +- [ ] Check page limit for camera-ready (sometimes differs from submission) +- [ ] Upload supplementary materials (code, data, appendix) to venue portal +``` + +--- + +## Hermes Agent Integration + +This skill is designed for the Hermes agent. It uses Hermes tools, delegation, scheduling, and memory for the full research lifecycle. + +### Related Skills + +Compose this skill with other Hermes skills for specific phases: + +| Skill | When to Use | How to Load | +|-------|-------------|-------------| +| **arxiv** | Phase 1 (Literature Review): searching arXiv, generating BibTeX, finding related papers via Semantic Scholar | `skill_view("arxiv")` | +| **subagent-driven-development** | Phase 5 (Drafting): parallel section writing with 2-stage review (spec compliance then quality) | `skill_view("subagent-driven-development")` | +| **plan** | Phase 0 (Setup): creating structured plans before execution. Writes to `.hermes/plans/` | `skill_view("plan")` | +| **qmd** | Phase 1 (Literature): searching local knowledge bases (notes, transcripts, docs) via hybrid BM25+vector search | Install: `skill_manage("install", "qmd")` | +| **diagramming** | Phase 4-5: creating Excalidraw-based figures and architecture diagrams | `skill_view("diagramming")` | +| **data-science** | Phase 4 (Analysis): Jupyter live kernel for interactive analysis and visualization | `skill_view("data-science")` | + +**This skill supersedes `ml-paper-writing`** — it contains all of ml-paper-writing's content plus the full experiment/analysis pipeline and autoreason methodology. + +### Hermes Tools Reference + +| Tool | Usage in This Pipeline | +|------|----------------------| +| **`terminal`** | LaTeX compilation (`latexmk -pdf`), git operations, launching experiments (`nohup python run.py &`), process checks | +| **`process`** | Background experiment management: `process("start", ...)`, `process("poll", pid)`, `process("log", pid)`, `process("kill", pid)` | +| **`execute_code`** | Run Python for citation verification, statistical analysis, data aggregation. Has tool access via RPC. | +| **`read_file`** / **`write_file`** / **`patch`** | Paper editing, experiment scripts, result files. Use `patch` for targeted edits to large .tex files. | +| **`web_search`** | Literature discovery: `web_search("transformer attention mechanism 2024")` | +| **`web_extract`** | Fetch paper content, verify citations: `web_extract("https://arxiv.org/abs/2303.17651")` | +| **`delegate_task`** | **Parallel section drafting** — spawn isolated subagents for each section. Also for concurrent citation verification. | +| **`todo`** | Primary state tracker across sessions. Update after every phase transition. | +| **`memory`** | Persist key decisions across sessions: contribution framing, venue choice, reviewer feedback. | +| **`cronjob`** | Schedule experiment monitoring, deadline countdowns, automated arXiv checks. | +| **`clarify`** | Ask the user targeted questions when blocked (venue choice, contribution framing). | +| **`send_message`** | Notify user when experiments complete or drafts are ready, even if user isn't in chat. | + +### Tool Usage Patterns + +**Experiment monitoring** (most common): +``` +terminal("ps aux | grep ") +→ terminal("tail -30 ") +→ terminal("ls results/") +→ execute_code("analyze results JSON, compute metrics") +→ terminal("git add -A && git commit -m '' && git push") +→ send_message("Experiment complete: ") +``` + +**Parallel section drafting** (using delegation): +``` +delegate_task("Draft the Methods section based on these experiment scripts and configs. + Include: pseudocode, all hyperparameters, architectural details sufficient for + reproduction. Write in LaTeX using the neurips2025 template conventions.") + +delegate_task("Draft the Related Work section. Use web_search and web_extract to + find papers. Verify every citation via Semantic Scholar. Group by methodology.") + +delegate_task("Draft the Experiments section. Read all result files in results/. + State which claim each experiment supports. Include error bars and significance.") +``` + +Each delegate runs as a **fresh subagent** with no shared context — provide all necessary information in the prompt. Collect outputs and integrate. + +**Citation verification** (using execute_code): +```python +# In execute_code: +from semanticscholar import SemanticScholar +import requests + +sch = SemanticScholar() +results = sch.search_paper("attention mechanism transformers", limit=5) +for paper in results: + doi = paper.externalIds.get('DOI', 'N/A') + if doi != 'N/A': + bibtex = requests.get(f"https://doi.org/{doi}", + headers={"Accept": "application/x-bibtex"}).text + print(bibtex) +``` + +### State Management with `memory` and `todo` + +**`memory` tool** — persist key decisions (bounded: ~2200 chars for MEMORY.md): + +``` +memory("add", "Paper: autoreason. Venue: NeurIPS 2025 (9 pages). + Contribution: structured refinement works when generation-evaluation gap is wide. + Key results: Haiku 42/42, Sonnet 3/5, S4.6 constrained 2/3. + Status: Phase 5 — drafting Methods section.") +``` + +Update memory after major decisions or phase transitions. This persists across sessions. + +**`todo` tool** — track granular progress: + +``` +todo("add", "Design constrained task experiments for Sonnet 4.6") +todo("add", "Run Haiku baseline comparison") +todo("add", "Draft Methods section") +todo("update", id=3, status="in_progress") +todo("update", id=1, status="completed") +``` + +**Session startup protocol:** +``` +1. todo("list") # Check current task list +2. memory("read") # Recall key decisions +3. terminal("git log --oneline -10") # Check recent commits +4. terminal("ps aux | grep python") # Check running experiments +5. terminal("ls results/ | tail -20") # Check for new results +6. Report status to user, ask for direction +``` + +### Cron Monitoring with `cronjob` + +Use the `cronjob` tool to schedule periodic experiment checks: + +``` +cronjob("create", { + "schedule": "*/30 * * * *", # Every 30 minutes + "prompt": "Check experiment status: + 1. ps aux | grep run_experiment + 2. tail -30 logs/experiment_haiku.log + 3. ls results/haiku_baselines/ + 4. If complete: read results, compute Borda scores, + git add -A && git commit -m 'Add Haiku results' && git push + 5. Report: table of results, key finding, next step + 6. If nothing changed: respond with [SILENT]" +}) +``` + +**[SILENT] protocol**: When nothing has changed since the last check, respond with exactly `[SILENT]`. This suppresses notification delivery to the user. Only report when there are genuine changes worth knowing about. + +**Deadline tracking**: +``` +cronjob("create", { + "schedule": "0 9 * * *", # Daily at 9am + "prompt": "NeurIPS 2025 deadline: May 22. Today is {date}. + Days remaining: {compute}. + Check todo list — are we on track? + If <7 days: warn user about remaining tasks." +}) +``` + +### Communication Patterns + +**When to notify the user** (via `send_message` or direct response): +- Experiment batch completed (with results table) +- Unexpected finding or failure requiring decision +- Draft section ready for review +- Deadline approaching with incomplete tasks + +**When NOT to notify:** +- Experiment still running, no new results → `[SILENT]` +- Routine monitoring with no changes → `[SILENT]` +- Intermediate steps that don't need attention + +**Report format** — always include structured data: +``` +## Experiment: +Status: Complete / Running / Failed + +| Task | Method A | Method B | Method C | +|------|---------|---------|---------| +| Task 1 | 85.2 | 82.1 | **89.4** | + +Key finding: +Next step: +``` + +### Decision Points Requiring Human Input + +Use `clarify` for targeted questions when genuinely blocked: + +| Decision | When to Ask | +|----------|-------------| +| Target venue | Before starting paper (affects page limits, framing) | +| Contribution framing | When multiple valid framings exist | +| Experiment priority | When TODO list has more experiments than time allows | +| Submission readiness | Before final submission | + +**Do NOT ask about** (be proactive, make a choice, flag it): +- Word choice, section ordering +- Which specific results to highlight +- Citation completeness (draft with what you find, note gaps) + +--- + +## Reviewer Evaluation Criteria + +Understanding what reviewers look for helps focus effort: + +| Criterion | What They Check | +|-----------|----------------| +| **Quality** | Technical soundness, well-supported claims, fair baselines | +| **Clarity** | Clear writing, reproducible by experts, consistent notation | +| **Significance** | Community impact, advances understanding | +| **Originality** | New insights (doesn't require new method) | + +**Scoring (NeurIPS 6-point scale):** +- 6: Strong Accept — groundbreaking, flawless +- 5: Accept — technically solid, high impact +- 4: Borderline Accept — solid, limited evaluation +- 3: Borderline Reject — weaknesses outweigh +- 2: Reject — technical flaws +- 1: Strong Reject — known results or ethics issues + +See [references/reviewer-guidelines.md](references/reviewer-guidelines.md) for detailed guidelines, common concerns, and rebuttal strategies. + +--- + +## Common Issues and Solutions + +| Issue | Solution | +|-------|----------| +| Abstract too generic | Delete first sentence if it could prepend any ML paper. Start with your specific contribution. | +| Introduction exceeds 1.5 pages | Split background into Related Work. Front-load contribution bullets. | +| Experiments lack explicit claims | Add: "This experiment tests whether [specific claim]..." before each one. | +| Reviewers find paper hard to follow | Add signposting, use consistent terminology, make figure captions self-contained. | +| Missing statistical significance | Add error bars, number of runs, statistical tests, confidence intervals. | +| Scope creep in experiments | Every experiment must map to a specific claim. Cut experiments that don't. | +| Paper rejected, need to resubmit | See Conference Resubmission in Phase 7. Address reviewer concerns without referencing reviews. | + +--- + +## Reference Documents + +| Document | Contents | +|----------|----------| +| [references/writing-guide.md](references/writing-guide.md) | Gopen & Swan 7 principles, Perez micro-tips, Lipton word choice, Steinhardt precision, figure design | +| [references/citation-workflow.md](references/citation-workflow.md) | Citation APIs, Python code, CitationManager class, BibTeX management | +| [references/checklists.md](references/checklists.md) | NeurIPS 16-item, ICML, ICLR, ACL requirements, universal pre-submission checklist | +| [references/reviewer-guidelines.md](references/reviewer-guidelines.md) | Evaluation criteria, scoring, common concerns, rebuttal template | +| [references/sources.md](references/sources.md) | Complete bibliography of all writing guides, conference guidelines, APIs | +| [references/experiment-patterns.md](references/experiment-patterns.md) | Experiment design patterns, evaluation protocols, monitoring, error recovery | +| [references/autoreason-methodology.md](references/autoreason-methodology.md) | Autoreason loop, strategy selection, model guide, prompts, scope constraints, Borda scoring | + +### LaTeX Templates + +Templates in `templates/` for: **NeurIPS 2025**, **ICML 2026**, **ICLR 2026**, **ACL**, **AAAI 2026**, **COLM 2025**. + +See [templates/README.md](templates/README.md) for compilation instructions. + +### Key External Sources + +**Writing Philosophy:** +- [Neel Nanda: How to Write ML Papers](https://www.alignmentforum.org/posts/eJGptPbbFPZGLpjsp/highly-opinionated-advice-on-how-to-write-ml-papers) +- [Sebastian Farquhar: How to Write ML Papers](https://sebastianfarquhar.com/on-research/2024/11/04/how_to_write_ml_papers/) +- [Gopen & Swan: Science of Scientific Writing](https://cseweb.ucsd.edu/~swanson/papers/science-of-writing.pdf) +- [Lipton: Heuristics for Scientific Writing](https://www.approximatelycorrect.com/2018/01/29/heuristics-technical-scientific-writing-machine-learning-perspective/) +- [Perez: Easy Paper Writing Tips](https://ethanperez.net/easy-paper-writing-tips/) + +**APIs:** [Semantic Scholar](https://api.semanticscholar.org/api-docs/) | [CrossRef](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) | [arXiv](https://info.arxiv.org/help/api/basics.html) + +**Venues:** [NeurIPS](https://neurips.cc/Conferences/2025/PaperInformation/StyleFiles) | [ICML](https://icml.cc/Conferences/2025/AuthorInstructions) | [ICLR](https://iclr.cc/Conferences/2026/AuthorGuide) | [ACL](https://github.com/acl-org/acl-style-files) diff --git a/skills/research/research-paper-writing/references/autoreason-methodology.md b/skills/research/research-paper-writing/references/autoreason-methodology.md new file mode 100644 index 000000000..a77fe14a6 --- /dev/null +++ b/skills/research/research-paper-writing/references/autoreason-methodology.md @@ -0,0 +1,394 @@ +# Autoreason: Iterative Refinement Methodology + +Complete reference for the autoreason iterative refinement method, derived from experimental results across subjective writing tasks, competitive programming, and four model tiers. Use this when any output (paper draft, experiment script, analysis, task definition) needs iterative improvement. + +**Source**: [NousResearch/autoreason](https://github.com/NousResearch/autoreason) — "Autoreason: When Iterative LLM Refinement Works and Why It Fails" + +--- + +## Strategy Selection Guide + +### Decision Tree + +``` +Is the task objectively verifiable (code, math, factual)? +├── YES → Does the model solve it on the first attempt? +│ ├── YES → Use single pass (no refinement needed) +│ └── NO → Use autoreason (structured analysis → reason-informed revision) +│ +└── NO (subjective) → What model tier are you using? + ├── Weak (Llama 8B, small models) + │ → Single pass. Model too weak for refinement to help. + │ Invest in generation quality, not iteration. + │ + ├── Mid-tier (Haiku 3.5, Gemini Flash) + │ → Autoreason with stronger judges. This is the sweet spot. + │ Self-refinement DESTROYS weak model outputs — autoreason prevents this. + │ + ├── Strong (Sonnet 4) + │ → Autoreason for open-ended tasks. Wins 3/5. + │ Critique-and-revise for concrete technical tasks (2/5). + │ + └── Frontier (Sonnet 4.6, Opus) + ├── Constrained scope? → Autoreason. Wins 2/3 constrained tasks. + └── Unconstrained? → Critique-and-revise or single pass. + Autoreason FAILS on unconstrained frontier tasks (comes last). +``` + +### Strategy Comparison Table + +| Strategy | Best For | Avoid When | Compute (per iteration) | +|----------|----------|------------|------------------------| +| **Single pass** | Frontier models, template tasks, tight budgets | Mid-tier models where quality ceiling is low | 1 call | +| **Critique-and-revise** | Concrete technical requirements (system design, specifications) | Weak models (degrades output), unconstrained subjective tasks | 2 calls | +| **Autoreason** | Mid-tier models, constrained scope, tasks with genuine tradeoffs | Weak models (Llama 8B), frontier + unconstrained | ~6 calls | +| **Best-of-N** | Almost never recommended | Weak models especially — worse than single pass | N calls | + +### Why Each Strategy Fails + +| Strategy | Failure Mode | Mechanism | +|----------|-------------|-----------| +| **Single pass** | Quality ceiling | No mechanism to improve beyond first attempt | +| **Critique-and-revise** | Progressive degradation | Model hallucinates problems (sycophancy), scope creeps each pass, never declines to change | +| **Best-of-N** | Random selection | Without good ranking signal, more samples = more mediocre options | +| **Autoreason (unconstrained)** | Synthesis drift | Stronger models produce syntheses so consistently preferred that incumbent never stabilizes | + +--- + +## The Autoreason Loop + +### Architecture + +``` +┌──────────────────────────────────────────────────────────┐ +│ ITERATION LOOP │ +│ │ +│ Incumbent A ──► Critic ──► Author B ──► Synthesizer │ +│ │ │ │ +│ │ ┌───────────────────────┘ │ +│ ▼ ▼ │ +│ [A] [AB] [B] │ +│ │ │ │ │ +│ └──────────────┼────────────┘ │ +│ ▼ │ +│ Judge Panel (blind) │ +│ │ │ +│ ▼ │ +│ Winner │ +│ │ │ +│ ┌───────┴───────┐ │ +│ ▼ ▼ │ +│ A wins k=2 B or AB wins │ +│ consecutive? → new incumbent │ +│ │ │ +│ ▼ │ +│ CONVERGED │ +└──────────────────────────────────────────────────────────┘ +``` + +### Roles + +Every role is a **fresh, isolated agent** with no shared context: + +| Role | Input | Output | Key Rule | +|------|-------|--------|----------| +| **Critic** | Task + Incumbent A | List of problems | Find problems ONLY. No fixes. No suggestions. | +| **Author B** | Task + A + Critique | Revised version B | Address each criticism. State which problem each change fixes. | +| **Synthesizer** | Task + X + Y (randomized labels) | Synthesis AB | Take strongest elements of each. Not a compromise. | +| **Judge Panel** | Task + A, AB, B (randomized labels + order) | Ranking | Rank best to worst. No authorship stake. | + +### Configuration + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| **Convergence k** | 2 | k=1 premature (94% displaced later). k=2 converges 100%, quality plateaus. k=3 fails 24%, 2x cost, no quality gain. | +| **Author temperature** | 0.7-0.8 | Encourages diverse revisions | +| **Judge temperature** | 0.3 | Encourages consistent evaluation | +| **In-loop judges** | 3 | Balance per-pass cost vs evaluation stability | +| **Final evaluation judges** | 7 | Higher statistical power for final comparison | +| **Max tokens** | 4096 | Standard; 8192 for long-form (papers) | +| **Judge type** | Chain-of-thought | 3x faster convergence on some tasks. Always use. | +| **Tiebreak** | Conservative (incumbent wins) | Prevents false positives — A must be genuinely beaten | +| **Max passes** | 25 (constrained), 50 (remedy) | Safety cap; most converge by pass 10-15 | + +### Prompts + +#### Critic +``` +System: You are a critical reviewer. Your only job is to find real problems. +Be specific and concrete. Do not suggest fixes. + +User: Find real problems with this proposal. Focus on: +- Things that won't work as described +- Complexity that doesn't pay for itself +- Assumptions that are wrong +- Missing pieces +Do NOT propose fixes. Just the problems. +``` + +#### Author B +``` +System: You are a senior consultant revising a proposal based on specific +criticisms. Address each valid criticism directly. Do not make changes not +motivated by an identified problem. + +User: [TASK] + [VERSION A] + [CRITIC OUTPUT] +Revise to address these problems. For each change, state which problem it fixes. +``` + +#### Synthesizer +``` +System: You are given two versions as equal inputs. Take the strongest elements +from each and produce a coherent synthesis. This is not a compromise. + +User: [TASK] + [VERSION X] + [VERSION Y] +(labels randomized — synthesizer doesn't know which is incumbent) +``` + +#### Judge (Chain-of-Thought) — ALWAYS USE THIS VERSION +``` +System: You are an independent evaluator. Think carefully before deciding. + +User: [TASK] + Three proposals. For each, think step by step: +1. What does it get right? +2. What does it get wrong or miss? +3. Are numbers and claims defensible? +4. Is detail appropriate or bloated? +After reasoning, rank all three. +RANKING: [best], [second], [worst] +``` + +#### Baseline Prompts (for comparison experiments) + +| Baseline | Prompt | +|----------|--------| +| **Conservative** | "Make minimal improvements while preserving what works. Do not add new sections or significantly expand scope." | +| **Improve this** | "Improve this document." (no further guidance) | +| **Harsh critic** | "Critically evaluate and rewrite, fixing all weaknesses you identify." | +| **Critique & revise** | Step 1: "Produce a structured critique. List specific weaknesses." Step 2: "Revise to address each criticism." | + +--- + +## Scoring: Borda Count + +Judges rank candidates. Points awarded by rank position: + +| Rank | Points (3 candidates) | +|------|----------------------| +| 1st | 3 | +| 2nd | 2 | +| 3rd | 1 | + +**Aggregation**: Sum across all judges. Winner = highest total. +**Tiebreak**: Incumbent (A) wins any tie. + +**Example** (3 judges): +- Judge 1: AB > A > B → AB gets 3, A gets 2, B gets 1 +- Judge 2: A > AB > B → A gets 3, AB gets 2, B gets 1 +- Judge 3: AB > B > A → AB gets 3, B gets 2, A gets 1 +- Totals: AB=8, A=6, B=4 → AB wins, becomes new incumbent + +**Randomization per judge**: +- Candidate labels randomized (A might be called "Proposal X" for one judge, "Proposal Z" for another) +- Presentation order randomized (AB might appear first or last) +- This prevents position bias and label bias + +--- + +## Model Selection Guide + +### Empirical Results by Model Tier + +| Model | Autoreason Wins | Autoreason Avg Borda | Best Baseline | Margin | Recommendation | +|-------|----------------|---------------------|---------------|--------|----------------| +| **Llama 3.1 8B** | 1/3 | 23.7 | 25.0 (single) | -1.3 | Skip autoreason. Model too weak for diverse candidates. | +| **Gemini 2.0 Flash** | 2/3 | 25.0 | 20.0 (single) | +5.0 | Good candidate. Moderate gains. | +| **Haiku 3.5** | 3/3 | **42.0** | 33.7 (single) | **+8.3** | **Best candidate.** Perfect scores. Baselines actively destroy quality. | +| **Sonnet 4** | 3/5 | 27.8 | 22.4 (C&R) | +5.4 | Good candidate for open tasks. C&R better for technical tasks. | +| **Sonnet 4.6 (unconstrained)** | 0/1 | 7.0 | 31.0 (C&R) | -24.0 | Do NOT use autoreason without constraints. | +| **Sonnet 4.6 (constrained)** | 2/3 | 29.0 | 27.0 (improve) | +2.0 | Use only with scope constraints. | + +### The Generation-Evaluation Gap + +The core insight: **autoreason's value depends on the gap between a model's generation capability and its self-evaluation capability.** + +``` +Weak models (Llama 8B): + Generation: Poor | Self-evaluation: Poor + Gap: Small (both bad) → Autoreason can't help, no diverse candidates + +Mid-tier models (Haiku, Flash): + Generation: Decent | Self-evaluation: Poor + Gap: LARGE → Autoreason's sweet spot. External eval bridges the gap. + +Strong models (Sonnet 4): + Generation: Good | Self-evaluation: Decent + Gap: Moderate → Autoreason helps on 3/5 tasks + +Frontier models (Sonnet 4.6): + Generation: Excellent | Self-evaluation: Good + Gap: Small → Simple methods suffice. Autoreason hurts on unconstrained tasks. +``` + +**Practical rule**: As model costs drop and capabilities improve, today's frontier becomes tomorrow's mid-tier. The generation-evaluation gap is structural, not temporary. Match refinement architecture to the model's position on the capability curve. + +### Judge Selection + +| Author Model | Recommended Judge | Rationale | +|-------------|------------------|-----------| +| Llama 8B | Don't use autoreason | Model too weak | +| Gemini Flash | Sonnet 4 | Cross-model evaluation works | +| Haiku 3.5 | Sonnet 4 | Strong external eval is the mechanism | +| Haiku 3.5 | Haiku 3.5 (same) | Still works — tournament structure provides value even without strong judges (20.7 vs 18.3 avg Borda) | +| Sonnet 4 | Sonnet 4 (same) | Same-model judges work at this tier | +| Sonnet 4.6 | Sonnet 4.6 (same) | Only with scope constraints | + +--- + +## Scope Constraint Design + +### What Makes Autoreason Work on Constrained Tasks + +The same model (Sonnet 4.6) goes from **last place** (unconstrained) to **first place** (constrained) with scope constraints. The constraints bound the improvement space so synthesis drift can't accumulate. + +### Effective Constraints + +| Constraint Type | Example | Why It Works | +|----------------|---------|-------------| +| **Fixed facts** | "Use only these 8 data points, add nothing else" | Bounds information space | +| **Fixed deliverable** | "500-word startup pitch" (not "improve this") | Defines done condition | +| **Fixed structure** | "Exactly 4 sections, each with 3 numbered items" | Prevents structural drift | +| **Fixed change items** | "Address exactly these 3 reviewer concerns" | Bounds modification scope | + +### Ineffective Constraints + +| Constraint | Why It Fails | What Happens | +|-----------|-------------|-------------| +| Word count alone | Not a scope constraint | False convergence — rejected for length, not quality | +| "Be concise" | Too vague | Ignored after 2-3 passes | +| "Be comprehensive" | Anti-constraint | Invites scope creep | +| No constraints at all | Unbounded improvement space | Synthesis dominates, no convergence | + +### Task Categories + +| Task Type | Autoreason Works? | Why | +|-----------|-------------------|-----| +| Tasks with genuine tradeoffs (strategy, policy) | Yes | Multiple valid approaches for tournament to select between | +| Constrained writing (pitch, memo, postmortem) | Mostly (2/3) | Bounded scope, clear evaluation criteria | +| Template-filling (incident postmortem) | No | One correct structure, minimal decision space | +| Competitive programming | Yes | Naturally scoped, test suite provides external verification | +| Open-ended unconstrained + frontier model | No | Synthesis drift, no convergence | + +--- + +## Failure Taxonomy + +| Failure Mode | Condition | Detection | Evidence | +|-------------|-----------|-----------|----------| +| **Self-correction unreliable** | No external evaluation signal | Baselines degrade below single pass | Haiku baselines: 16.3 avg vs 33.7 single pass | +| **Drift / synthesis dominance** | Unconstrained scope | A wins <15%, AB dominates | Sonnet 4.6 unconstrained: A wins 12%, AB wins 60%+ | +| **Overfitting to visible feedback** | Shallow revision loop (C&R) | High public/private divergence | C&R overfits 32% on hard code problems | +| **No convergence** | Broken judge pipeline | Parsing failures, <3 valid judges | Mixed panel parser failure: 11+ passes | +| **Model too weak** | Insufficient generation diversity | All candidates look similar | Llama 8B wins only 1/3 tasks | + +### Recovery Patterns + +| Failure | Recovery | +|---------|----------| +| No convergence (drift) | Add scope constraints to the task | +| No convergence (broken judges) | Fix parser, ensure 3 valid judges before continuing | +| Quality degrades with iteration | Switch to single pass or add constraints | +| Model too weak | Use a stronger model for generation, keep weak model for cheap roles | +| Overfitting (code) | Use structured analysis step, not just test feedback | + +--- + +## Code Domain Adaptation + +The autoreason method adapts differently for code vs writing: + +### Writing Domain +``` +Call 1: Critic (find problems in incumbent) +Call 2: Author B (revise based on critique) +Call 3: Synthesizer (merge A and B) +Calls 4-6: Judge Panel (3 blind judges rank A, B, AB) +``` + +### Code Domain (6-call budget) +``` +Call 1: Initial generation +Call 2: Structured analysis (5 points — NO CODE): + - Problem analysis: what does the problem actually require? + - Approach analysis: what approach did we use, is it correct? + - Failure analysis: why did tests fail? + - Alternative approaches: what else could work? + - Edge cases: what inputs might break the solution? +Calls 3-6: Reason-informed revisions + - Each revision must explain WHY it fixes the issue + - Sees test results from public (visible) test cases +``` + +**Key difference**: The code strategy replaces the judge panel with test-suite evaluation (objective ground truth). The structured analysis step (Call 2) is what drives recovery — it forces reasoning about *why* the approach failed before attempting fixes. + +**Results**: Recovery is the mechanism. Among problems where both autoreason and single-pass failed initially, autoreason recovered 62% vs single-pass's 43% (McNemar p=0.041, Cohen's h=0.32). + +--- + +## Applying Autoreason to Paper Writing + +The paper itself was refined using autoreason (Section 8 of the paper): + +### Setup +- Model: claude-opus-4 +- Judges: 3 Opus judges +- Enhancement: Ground-truth critic (access to actual experimental data) +- Result: Converged in 9 passes + +### Key Findings for Paper Refinement + +1. **Ground-truth critic is essential**: Without ground-truth access, Opus hallucinated a fabricated ablation study, fake confidence intervals, wrong model names, and incorrect role descriptions. With ground-truth access, the critic caught all four on pass 1. + +2. **Judge panel integrity matters**: A broken parser in one judge (Gemini output format mismatch) reduced the panel from 3 to 2 judges. This prevented convergence for 11+ passes. Fixing to 3 working judges, the same incumbent converged in 2 passes. A broken judge doesn't add noise — it prevents equilibrium. + +### Recommended Setup for Paper Refinement + +``` +Critic prompt: "You are reviewing a research paper draft. You have access to the +actual experimental results [GROUND TRUTH DATA]. Find factual errors, unsupported +claims, hallucinated results, and structural problems. Do not suggest fixes." + +Author B prompt: "Revise this paper draft to fix the identified problems. For each +change, cite the specific problem it addresses. Do not add claims not supported by +the provided experimental data." + +Judge prompt (CoT): "Compare three versions of this paper. For each, evaluate: +1. Factual accuracy against the provided results +2. Clarity of the narrative and contribution +3. Whether claims are properly hedged and supported +4. Writing quality (concision, precision, no filler) +After reasoning, rank all three. RANKING: [best], [second], [worst]" +``` + +### What to Provide as Ground Truth +- All experimental result JSON files +- Statistical test outputs +- Raw numbers for every table and figure +- Configuration files showing exact hyperparameters +- Code that generated the results (for method description accuracy) + +--- + +## Compute Budget Reference + +| Method | Calls per Pass | Typical Passes | Total Calls | Relative Cost | +|--------|---------------|----------------|-------------|---------------| +| Single pass | 1 | 1 | 1 | 1x | +| Best-of-N | N | 1 | N | Nx | +| Critique & revise | 2 | 15 | 30 | 30x | +| Autoreason (in-loop) | ~6 | 10-15 | 60-90 | 60-90x | +| Autoreason (with final eval) | ~6 + 7 | 10-15 + 1 | 67-97 | ~80x | + +**Cost-quality tradeoff**: Autoreason uses ~6x more compute per pass and typically runs more passes. This is a real tradeoff. The method trades compute for evaluation quality. On constrained tasks with mid-tier models, this tradeoff is strongly positive. On unconstrained tasks with frontier models, it's negative. + +**CoT judges reduce cost**: 1 CoT judge provides evaluation quality comparable to 3 standard judges, at ~40% cost savings. Always use CoT judges. diff --git a/skills/research/ml-paper-writing/references/checklists.md b/skills/research/research-paper-writing/references/checklists.md similarity index 79% rename from skills/research/ml-paper-writing/references/checklists.md rename to skills/research/research-paper-writing/references/checklists.md index 1c46b75cc..7c65bb955 100644 --- a/skills/research/ml-paper-writing/references/checklists.md +++ b/skills/research/research-paper-writing/references/checklists.md @@ -10,6 +10,8 @@ This reference documents the mandatory checklist requirements for major ML/AI co - [ICML Paper Checklist](#icml-paper-checklist) - [ICLR Requirements](#iclr-requirements) - [ACL Requirements](#acl-requirements) +- [AAAI Requirements](#aaai-requirements) +- [COLM Requirements](#colm-requirements) - [Universal Pre-Submission Checklist](#universal-pre-submission-checklist) --- @@ -280,6 +282,77 @@ If applicable: --- +## AAAI Requirements + +### Formatting (Strictest of All Venues) + +AAAI enforces formatting rules more strictly than any other major venue. Papers that deviate from the template are desk-rejected. + +- [ ] Use the **exact** AAAI style file without modification — no `\setlength`, no `\vspace` hacks, no font overrides +- [ ] 7 pages main content (8 for camera-ready with author info) +- [ ] Two-column format, Times font (set by template) +- [ ] References and appendices do not count toward page limit +- [ ] Abstract must be a single paragraph +- [ ] Do not modify margins, column widths, or font sizes + +### Required Sections + +- [ ] Abstract (single paragraph, no math or citations) +- [ ] Introduction with clear contribution statement +- [ ] References in AAAI format (uses `aaai2026.bst`) +- [ ] Appendix (optional, unlimited) + +### Ethics and Reproducibility + +- [ ] Broader impact statement (encouraged but not always mandatory — check current year's CFP) +- [ ] Reproducibility details (datasets, code availability) +- [ ] Acknowledge use of AI writing tools if applicable + +### Key Differences from Other Venues + +- **No separate limitations section required** (unlike ACL), but discussing limitations is recommended +- **Strictest formatting enforcement** — the style checker will reject non-compliant PDFs +- **No paper checklist** like NeurIPS has, but the universal checklist below still applies +- **Unified template** covers main paper and supplementary in the same file + +--- + +## COLM Requirements + +### Overview + +COLM (Conference on Language Modeling) focuses specifically on language model research. Framing must target this community. + +### Formatting + +- [ ] 9 pages main content (10 for camera-ready) +- [ ] Use COLM template (based on ICLR template with modifications) +- [ ] Double-blind review +- [ ] References and appendices unlimited + +### Required Sections + +- [ ] Abstract +- [ ] Introduction framed for language modeling community +- [ ] Conclusion +- [ ] References + +### Content Expectations + +- [ ] Contribution must be relevant to language models (broadly interpreted: training, evaluation, applications, theory, alignment, safety) +- [ ] If the method is general, frame with language model examples +- [ ] Baselines should include recent LM-specific methods where applicable + +### Key Differences from Other Venues + +- **Narrower scope** than NeurIPS/ICML — must frame for LM community +- **Template derived from ICLR** — similar formatting rules +- **Newer venue** — reviewer norms are still establishing; err on the side of thorough evaluation +- **No mandatory checklist** like NeurIPS, but broader impact discussion is expected +- **LLM disclosure**: If LLMs were used in research (code generation, data annotation, writing assistance), disclose this + +--- + ## Universal Pre-Submission Checklist ### Before Every Submission diff --git a/skills/research/ml-paper-writing/references/citation-workflow.md b/skills/research/research-paper-writing/references/citation-workflow.md similarity index 97% rename from skills/research/ml-paper-writing/references/citation-workflow.md rename to skills/research/research-paper-writing/references/citation-workflow.md index b2b33bd6f..3d188b52f 100644 --- a/skills/research/ml-paper-writing/references/citation-workflow.md +++ b/skills/research/research-paper-writing/references/citation-workflow.md @@ -289,7 +289,7 @@ class CitationManager: ) if resp.status_code == 200: sources.append("CrossRef") - except: + except Exception: pass # Check arXiv if ID available @@ -301,7 +301,7 @@ class CitationManager: ) if "" in resp.text and "" in resp.text: sources.append("arXiv") - except: + except Exception: pass return len(sources) >= 2, sources @@ -318,7 +318,7 @@ class CitationManager: ) if resp.status_code == 200: return resp.text - except: + except Exception: pass # Fallback: generate from paper data @@ -419,7 +419,7 @@ def batch_cite(queries: List[str], output_file: str = "references.bib"): | Customization | Limited | Highly flexible | | Backend | bibtex | Biber (recommended) | -**Recommendation**: Use BibLaTeX with Biber for new papers. +**Recommendation**: Use natbib with BibTeX for conference submissions — all major venue templates (NeurIPS, ICML, ICLR, ACL, AAAI, COLM) ship with natbib and `.bst` files. BibLaTeX with Biber is an option for journals or personal projects where you control the template. ### LaTeX Setup diff --git a/skills/research/research-paper-writing/references/experiment-patterns.md b/skills/research/research-paper-writing/references/experiment-patterns.md new file mode 100644 index 000000000..f9fb243fe --- /dev/null +++ b/skills/research/research-paper-writing/references/experiment-patterns.md @@ -0,0 +1,728 @@ +# Experiment Design Patterns + +Patterns and best practices distilled from running research experiments at scale with the Hermes agent. These cover experiment infrastructure, evaluation protocols, monitoring, and failure recovery. + +--- + +## Experiment Infrastructure + +### Directory Structure + +Organize experiments with a consistent structure: + +``` +workspace/ + experiments/ + run_main.py # Core experiment runner + run_baselines.py # Baseline comparison + run_ablation.py # Ablation studies + strategies.py # Method implementations + config.yaml # Shared configuration + results/ + <experiment_name>/ + <task_or_problem>/ + <strategy>/ + result.json # Final metrics + final_output.md # Final output artifact + history.json # Full trajectory/log + pass_01/ # Per-iteration artifacts (if iterative) + intermediate.md + analysis/ + analyze_results.py # Statistical analysis + compute_stats.py # Significance tests + make_charts.py # Visualization + paper/ + paper.tex # LaTeX source + fig_*.pdf # Generated figures +``` + +### Script Design Principles + +**1. Incremental Saving (Crash Recovery)** + +Every experiment script should save results after each unit of work, and skip already-completed work on restart: + +```python +import json, os +from pathlib import Path + +def run_experiment(problems, strategies, output_dir): + for problem in problems: + for strategy in strategies: + result_path = Path(output_dir) / problem["id"] / strategy / "result.json" + if result_path.exists(): + print(f"Skipping {problem['id']}/{strategy} (already done)") + continue + + # Run the experiment + result = execute_strategy(problem, strategy) + + # Save immediately + result_path.parent.mkdir(parents=True, exist_ok=True) + with open(result_path, 'w') as f: + json.dump(result, f, indent=2) +``` + +This pattern makes re-runs safe and efficient. If a process crashes at problem 47/150, restarting skips the first 46. + +**2. Artifact Preservation** + +Save all intermediate outputs, not just final results. This enables post-hoc analysis without re-running: + +```python +def save_pass_artifacts(output_dir, pass_num, artifacts): + """Save all artifacts from a single pass of an iterative method.""" + pass_dir = Path(output_dir) / f"pass_{pass_num:02d}" + pass_dir.mkdir(parents=True, exist_ok=True) + + for name, content in artifacts.items(): + with open(pass_dir / f"{name}.md", 'w') as f: + f.write(content) +``` + +**3. Configuration Management** + +Use YAML configs for reproducibility: + +```yaml +# config.yaml +model: anthropic/claude-sonnet-4-20250514 +author_temperature: 0.8 +judge_temperature: 0.3 +max_tokens: 4096 +num_judges: 3 +max_passes: 15 +convergence_k: 2 +``` + +```python +import yaml + +with open("config.yaml") as f: + config = yaml.safe_load(f) +``` + +**4. Separation of Concerns** + +Keep generation, evaluation, and visualization in separate scripts: + +| Script | Purpose | +|--------|---------| +| `run_experiment.py` | Core method execution | +| `run_baselines.py` | Baseline comparisons at same compute | +| `run_eval.py` | Blind evaluation / judge panels | +| `analyze_results.py` | Statistical analysis | +| `make_charts.py` | Figure generation | + +This lets you re-run evaluation without re-running expensive generation, and regenerate figures without re-running analysis. + +--- + +## Evaluation Protocols + +### Blind Judge Panels (for Subjective Tasks) + +When evaluating subjective outputs (writing, analysis, recommendations), use a blind judge panel: + +```python +import random + +def run_blind_evaluation(outputs: dict, task_prompt: str, num_judges: int = 7): + """ + Run blind evaluation of multiple method outputs. + + Args: + outputs: {"method_name": "output_text", ...} + task_prompt: The original task description + num_judges: Number of independent judge evaluations + """ + rankings = [] + + for judge_i in range(num_judges): + # Randomize labels and presentation order per judge + methods = list(outputs.keys()) + random.shuffle(methods) + labels = {m: chr(65 + i) for i, m in enumerate(methods)} # A, B, C... + + # Present to judge with randomized labels + prompt = f"Task: {task_prompt}\n\n" + for method in methods: + prompt += f"--- Proposal {labels[method]} ---\n{outputs[method]}\n\n" + prompt += "Rank all proposals from best to worst. Format: RANKING: [best], [second], [worst]" + + ranking = call_judge(prompt) + rankings.append({"labels": labels, "ranking": ranking}) + + # Aggregate via Borda count + return compute_borda(rankings) + +def compute_borda(rankings, n_methods=3): + """Borda count: 3/2/1 points for 1st/2nd/3rd.""" + scores = {} + points = {0: n_methods, 1: n_methods - 1, 2: n_methods - 2} # Adjust for n_methods + + for r in rankings: + for position, method in enumerate(r["ranking"]): + scores[method] = scores.get(method, 0) + points.get(position, 0) + + return scores +``` + +Key design decisions: +- **Randomize both labels AND order** per judge to prevent position bias +- **Use odd number of judges** (3, 5, 7) to break ties +- **Conservative tiebreak**: Incumbent/baseline wins ties (prevents false positives) +- **CoT judges** match non-CoT quality at ~40% cost (1 CoT judge ≈ 3 standard judges) + +### Code/Objective Evaluation + +For tasks with ground-truth evaluation (code, math, factual): + +```python +import subprocess + +def evaluate_code(solution: str, test_cases: list, timeout: int = 30): + """Run code solution against test cases with sandboxed execution.""" + results = {"public": [], "private": []} + + for test in test_cases: + try: + proc = subprocess.run( + ["python3", "-c", solution], + input=test["input"], + capture_output=True, + timeout=timeout, + text=True + ) + actual = proc.stdout.strip() + expected = test["expected"].strip() + passed = actual == expected + except subprocess.TimeoutExpired: + passed = False + + category = "public" if test.get("public") else "private" + results[category].append(passed) + + return { + "public_pass_rate": sum(results["public"]) / max(len(results["public"]), 1), + "private_pass_rate": sum(results["private"]) / max(len(results["private"]), 1), + } +``` + +### Compute-Matched Comparison + +Always compare methods at equal compute budget. If your method uses N API calls, baselines get N calls too: + +| Method | Call Budget | Allocation | +|--------|-----------|------------| +| Single pass | 6 calls | 6 independent generations | +| Critique & revise | 6 calls | 1 generate + 5 revise rounds | +| Autoreason | 6 calls | 1 generate + 1 analysis + 4 revisions | +| Best-of-N | 6 calls | 6 independent, pick best on public test | + +### Human Evaluation Design + +Many ML/NLP papers require human evaluation, especially for subjective tasks (text generation, summarization, dialogue, creative writing). Poorly designed human evals are a common rejection reason. + +#### When Human Evaluation Is Required + +| Task Type | Required? | Notes | +|-----------|-----------|-------| +| Text generation (open-ended) | Yes | LLM-as-judge alone is insufficient for acceptance at ACL/EMNLP | +| Summarization | Usually | At minimum for a subset of outputs | +| Dialogue systems | Yes | User studies or annotation | +| Code generation | No | Test suites are objective ground truth | +| Classification | No | Standard metrics suffice | +| Any task with subjective quality | Strongly recommended | Strengthens the paper significantly | + +#### Annotation Protocol Design + +``` +Human Evaluation Protocol: +1. Define the evaluation dimensions (fluency, relevance, factual accuracy, etc.) +2. Create annotation guidelines with examples of each score level +3. Run a pilot with 2-3 annotators on 20-30 examples +4. Compute pilot inter-annotator agreement — if low, revise guidelines +5. Run full evaluation +6. Report: annotator count, agreement metrics, compensation, time per item +``` + +**Evaluation dimensions** (pick relevant subset): + +| Dimension | Definition | Scale | +|-----------|-----------|-------| +| Fluency | Grammaticality and naturalness | 1-5 Likert | +| Relevance | Does it address the task? | 1-5 Likert | +| Factual accuracy | Are stated facts correct? | Binary or 1-5 | +| Coherence | Logical flow and consistency | 1-5 Likert | +| Informativeness | Does it provide useful information? | 1-5 Likert | +| Overall preference | Which output is better? | A/B/Tie (pairwise) | + +**Pairwise comparison** (preferred over absolute scoring — more reliable): +- Present two outputs side-by-side (randomize left/right position) +- Ask: "Which is better? A / B / Tie" +- More discriminative and less susceptible to annotator calibration drift + +#### Inter-Annotator Agreement + +Always report agreement metrics. Without them, reviewers assume your annotations are unreliable. + +```python +# Krippendorff's alpha (preferred — handles missing data, any scale) +# pip install krippendorffs-alpha +import krippendorff + +# Ratings: rows = annotators, columns = items, values = scores +ratings = [ + [3, 4, 1, 2, 5, None, 3], # Annotator 1 + [3, 5, 1, 3, 5, 2, 3], # Annotator 2 + [4, 4, 2, 2, 4, 2, None], # Annotator 3 +] +alpha = krippendorff.alpha(reliability_data=ratings, level_of_measurement="ordinal") +print(f"Krippendorff's alpha: {alpha:.3f}") +# Interpretation: >0.80 good, 0.67-0.80 acceptable, <0.67 questionable +``` + +```python +# Cohen's kappa (for exactly 2 annotators, categorical data) +from sklearn.metrics import cohen_kappa_score + +annotator_1 = [1, 2, 3, 1, 2, 3, 2] +annotator_2 = [1, 2, 2, 1, 3, 3, 2] +kappa = cohen_kappa_score(annotator_1, annotator_2) +print(f"Cohen's kappa: {kappa:.3f}") +# Interpretation: >0.80 excellent, 0.60-0.80 substantial, 0.40-0.60 moderate +``` + +| Metric | When to Use | Annotators | Scale | +|--------|------------|-----------|-------| +| Krippendorff's alpha | Default choice | Any number | Any (ordinal, nominal, ratio) | +| Cohen's kappa | 2 annotators, categorical | Exactly 2 | Nominal/ordinal | +| Fleiss' kappa | 3+ annotators, categorical | 3+ | Nominal | +| Pearson/Spearman | Continuous scores | 2 | Interval/ratio | + +#### Crowdsourcing Platforms + +| Platform | Best For | Cost | Quality | +|----------|----------|------|---------| +| **Prolific** | Academic research, higher quality | $8-15/hr | High — academic participant pool | +| **MTurk** | Large-scale, fast turnaround | $2-10/hr | Variable — use qualifications | +| **Surge AI** | NLP-specific annotations | Premium | High — trained annotators | +| **Expert annotators** | Domain-specific (medical, legal) | Highest | Highest — but slow | + +**Ethics requirements**: +- Report compensation rate (must be at minimum local minimum wage) +- Describe annotator demographics if relevant +- Obtain IRB/ethics approval if required by your institution +- ACL venues explicitly require compensation documentation + +#### What to Report in the Paper + +``` +Human Evaluation Section Checklist: +- [ ] Number of annotators +- [ ] Annotator qualifications / recruitment method +- [ ] Number of items evaluated +- [ ] Evaluation dimensions with definitions +- [ ] Scale used (Likert, pairwise, binary) +- [ ] Inter-annotator agreement (Krippendorff's alpha or Cohen's kappa) +- [ ] Compensation rate +- [ ] Time per annotation item +- [ ] Whether annotators saw model identities (should be blind) +- [ ] Randomization of presentation order +``` + +--- + +## Statistical Analysis + +### Required Tests + +| Test | When to Use | Python | +|------|------------|--------| +| McNemar's test | Comparing two methods on same problems | `scipy.stats.binomtest` for small n | +| Two-proportion z-test | Comparing success rates | Custom or `statsmodels` | +| Fisher's exact test | Small sample pairwise comparison | `scipy.stats.fisher_exact` | +| Bootstrapped CI | Confidence intervals for any metric | Custom bootstrap | +| Cohen's h | Effect size for proportions | Manual calculation | + +### Standard Analysis Script + +```python +import numpy as np +from scipy import stats +from pathlib import Path +import json + +def load_all_results(results_dir): + """Load all results into a structured format.""" + results = {} + for result_file in Path(results_dir).rglob("result.json"): + parts = result_file.relative_to(results_dir).parts + if len(parts) >= 3: + experiment, task, strategy = parts[0], parts[1], parts[2] + data = json.loads(result_file.read_text()) + results.setdefault(experiment, {}).setdefault(strategy, {})[task] = data + return results + +def pairwise_mcnemar(method_a_results, method_b_results): + """McNemar's test for paired binary outcomes.""" + a_win_b_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if a and not b) + b_win_a_lose = sum(1 for a, b in zip(method_a_results, method_b_results) if b and not a) + + n = a_win_b_lose + b_win_a_lose + if n < 25: + # Use exact binomial for small samples + result = stats.binomtest(a_win_b_lose, n, 0.5) + p_value = result.pvalue + else: + # Chi-squared approximation + chi2 = (abs(a_win_b_lose - b_win_a_lose) - 1)**2 / (a_win_b_lose + b_win_a_lose) + p_value = 1 - stats.chi2.cdf(chi2, df=1) + + return { + "a_wins": a_win_b_lose, + "b_wins": b_win_a_lose, + "n_discordant": n, + "p_value": p_value, + "significant": p_value < 0.05 + } + +def bootstrap_ci(data, n_bootstrap=10000, ci=0.95): + """Bootstrap confidence interval for mean.""" + means = [] + for _ in range(n_bootstrap): + sample = np.random.choice(data, size=len(data), replace=True) + means.append(np.mean(sample)) + lower = np.percentile(means, (1 - ci) / 2 * 100) + upper = np.percentile(means, (1 + ci) / 2 * 100) + return {"mean": np.mean(data), "ci_lower": lower, "ci_upper": upper} + +def cohens_h(p1, p2): + """Cohen's h effect size for two proportions.""" + return 2 * np.arcsin(np.sqrt(p1)) - 2 * np.arcsin(np.sqrt(p2)) +``` + +### Reporting Standards + +Always include in the paper: +- **Sample sizes**: n=X problems/tasks +- **Number of runs**: K independent runs if applicable +- **Error bars**: Specify standard deviation or standard error +- **Confidence intervals**: 95% CI for key results +- **Significance tests**: p-values for key comparisons +- **Effect sizes**: Cohen's d or h for practical significance + +--- + +## Monitoring (Cron Pattern) + +### Cron Prompt Template + +For each experiment batch, create a monitoring prompt: + +``` +Check the status of the [EXPERIMENT_NAME] experiment: + +1. Process check: ps aux | grep [PROCESS_PATTERN] +2. Log check: tail -30 [LOG_FILE] +3. Results check: ls [RESULT_DIR]/eval/ (or appropriate result location) +4. If results are available: + - Read the result JSON files + - Report metrics in a table (Borda scores, accuracy, etc.) + - Compute key comparisons between methods +5. If all experiments in this batch are complete: + - git add -A && git commit -m "[COMMIT_MESSAGE]" && git push + - Report final summary +6. Key question: [SPECIFIC ANALYTICAL QUESTION] + +If nothing has changed since the last check, respond with [SILENT]. +``` + +### Monitoring Best Practices + +1. **Check processes first** — don't read results if the experiment is still running and results are incomplete +2. **Read the log tail** — look for errors, progress indicators, completion messages +3. **Count completed vs expected** — "45/150 problems done" is more useful than "some results exist" +4. **Report in structured tables** — always include key metrics in a table +5. **Answer the key question** — each experiment should have a specific analytical question to answer when done +6. **[SILENT] for no-news** — suppress notifications when nothing has changed +7. **Commit on completion** — every completed batch gets committed with a descriptive message + +### Example Monitoring Report + +``` +## Code Experiments (Haiku 3.5) - COMPLETE + +| Strategy | Pass Rate (150 problems) | vs Single | +|----------|------------------------|-----------| +| single_pass | 38.0% | — | +| critique_revise | 35.2% | -2.8pp | +| **autoreason** | **40.0%** | **+2.0pp** | +| best_of_6 | 31.0% | -7.0pp | + +Key finding: Autoreason shows +2pp improvement over single pass, while +best-of-6 collapses due to single-public-test selection issue. + +Committed: `git commit -m "Add Haiku code results (150 problems, 4 strategies)"` +Next: Run significance tests on these results. +``` + +--- + +## Failure Recovery + +### Common Failures and Recovery + +| Failure | Detection | Recovery | +|---------|-----------|----------| +| **API credit exhaustion** | 402 errors in logs, incomplete results | Top up credits, re-run (skips completed work automatically) | +| **Rate limiting** | 429 errors, slow progress | Add retry logic with exponential backoff | +| **Process crash** | PID gone, log stops mid-problem | Re-run script (resumes from last checkpoint) | +| **Wrong model ID** | Model not found errors | Fix ID (e.g., `claude-opus-4-6` not `claude-opus-4.6`) | +| **Parallel slowdown** | Each experiment taking 2x longer | Reduce parallel experiments to 2-3 max | +| **Security scan blocks** | Commands blocked by security | Use `execute_code` instead of piped `terminal` commands | +| **Delegation failures** | `delegate_task` returns errors | Fall back to doing work directly | +| **Timeout on hard problems** | Process stuck, no log progress | Kill, skip problem, note in results | +| **Dataset path mismatch** | File not found errors | Verify paths before launching | + +### Retry Naming Convention + +When re-running failed experiments, use a suffix to track rounds: + +``` +logs/experiment_haiku_0_50.log # Round 1 +logs/experiment_haiku_0_50_r2.log # Round 2 (after credit exhaustion) +logs/experiment_haiku_0_50_r3.log # Round 3 (after bug fix) +``` + +### Pre-Flight Checklist + +Before launching any experiment batch: + +``` +Pre-Flight: +- [ ] API credits sufficient for estimated calls +- [ ] Model IDs correct (test with 1 problem first) +- [ ] Output directory exists and is writable +- [ ] Resume logic works (re-run won't overwrite existing results) +- [ ] Log file path is unique (won't overwrite previous logs) +- [ ] Dataset/task files are accessible +- [ ] Config matches intended experiment +``` + +--- + +## Task/Benchmark Design + +### Open-Ended Tasks (Subjective Evaluation) + +Design tasks that have clear objectives but subjective quality: + +```markdown +# Task: [Title] + +## Context +[Specific scenario with concrete details: company size, constraints, timeline] + +## Deliverable +[Exact format and structure required] + +## Requirements +- [Specific, measurable requirements] +- [Not vague — "be comprehensive" is bad, "include exactly 6 sections" is good] +``` + +### Constrained Tasks (for Testing Scope Effects) + +Constrained tasks test whether methods respect scope boundaries. Design with: + +- **Fixed facts**: "Use only these N data points, add nothing else" +- **Fixed deliverable**: Specific format (pitch, postmortem, memo — not "improve this") +- **Fixed structure**: "These sections in this order, do not add/remove" +- **Fixed change items**: "Address exactly these N points, nothing else" + +**Do NOT use word count as a scope constraint.** Word limits cause false convergence — outputs get rejected for length, not quality. Constrain scope (what to include) not length. + +### Example: Good vs Bad Constraints + +| Bad Constraint | Why | Good Constraint | +|---------------|-----|-----------------| +| "Max 500 words" | Judges reject for length | "Exactly 4 sections, each with 3 numbered items" | +| "Be concise" | Too vague | "Each prohibition must reference a specific base fact" | +| "Improve this" | Unbounded scope | "Write a 600-word incident postmortem with this exact structure" | +| "Make it better" | No clear criterion | "Address exactly these 3 reviewer concerns" | + +--- + +## Visualization Best Practices + +### Setup: SciencePlots + matplotlib + +Install SciencePlots for publication-ready defaults: + +```bash +pip install SciencePlots matplotlib numpy +``` + +**Option A: SciencePlots styles** (recommended — handles most defaults automatically): + +```python +import matplotlib.pyplot as plt +import scienceplots # registers the styles + +# Pick a style: +# 'science' — clean, serif fonts, suitable for most venues +# 'science+ieee' — IEEE-style (good for two-column papers) +# 'science+nature' — Nature-style +# Add 'no-latex' if LaTeX is not installed on the machine generating plots + +with plt.style.context(['science', 'no-latex']): + fig, ax = plt.subplots(figsize=(3.5, 2.5)) # single-column width + # ... plot ... + fig.savefig('paper/fig_results.pdf', bbox_inches='tight') +``` + +**Option B: Manual rcParams** (when you need full control): + +```python +import matplotlib.pyplot as plt + +plt.rcParams.update({ + 'font.size': 10, + 'font.family': 'serif', + 'axes.labelsize': 11, + 'axes.titlesize': 11, + 'xtick.labelsize': 9, + 'ytick.labelsize': 9, + 'legend.fontsize': 9, + 'figure.figsize': (3.5, 2.5), # single-column default + 'figure.dpi': 300, + 'savefig.dpi': 300, + 'savefig.bbox': 'tight', + 'savefig.pad_inches': 0.05, + 'axes.linewidth': 0.8, + 'lines.linewidth': 1.5, + 'lines.markersize': 5, + 'axes.grid': True, + 'grid.alpha': 0.3, + 'grid.linewidth': 0.5, +}) +``` + +### Standard Figure Sizes (Two-Column Format) + +| Use Case | figsize | Notes | +|----------|---------|-------| +| Single column | `(3.5, 2.5)` | Fits in one column of two-column layout | +| Double column | `(7.0, 3.0)` | Spans full page width | +| Square (heatmap, confusion matrix) | `(3.5, 3.5)` | Single column | +| Tall single (many rows) | `(3.5, 5.0)` | Use sparingly | + +### Colorblind-Safe Palette (Okabe-Ito) + +Use this palette for all paper figures. It is distinguishable by people with all common forms of color vision deficiency: + +```python +COLORS = { + 'blue': '#0072B2', + 'orange': '#E69F00', + 'green': '#009E73', + 'red': '#D55E00', + 'purple': '#CC79A7', + 'cyan': '#56B4E9', + 'yellow': '#F0E442', + 'black': '#000000', +} + +# As a list for cycling: +COLOR_CYCLE = ['#0072B2', '#D55E00', '#009E73', '#E69F00', '#CC79A7', '#56B4E9'] +``` + +Also differentiate lines by **marker and linestyle**, not just color: +```python +STYLES = [ + {'color': '#0072B2', 'marker': 'o', 'linestyle': '-'}, + {'color': '#D55E00', 'marker': 's', 'linestyle': '--'}, + {'color': '#009E73', 'marker': '^', 'linestyle': '-.'}, + {'color': '#E69F00', 'marker': 'D', 'linestyle': ':'}, +] +``` + +### Complete Example: Method Comparison Bar Chart + +```python +import matplotlib.pyplot as plt +import numpy as np + +try: + import scienceplots + style = ['science', 'no-latex'] +except ImportError: + style = 'default' + +with plt.style.context(style): + methods = ['Single Pass', 'Critique+Revise', 'Best-of-N', 'Ours'] + scores = [73.2, 74.1, 68.5, 77.0] + errors = [2.1, 1.8, 3.2, 1.5] + colors = ['#56B4E9', '#E69F00', '#CC79A7', '#0072B2'] + + fig, ax = plt.subplots(figsize=(3.5, 2.5)) + bars = ax.bar(methods, scores, yerr=errors, capsize=3, + color=colors, edgecolor='black', linewidth=0.5) + + # Highlight "Ours" + bars[-1].set_edgecolor('#0072B2') + bars[-1].set_linewidth(1.5) + + ax.set_ylabel('Pass Rate (%)') + ax.set_ylim(60, 85) + ax.spines['top'].set_visible(False) + ax.spines['right'].set_visible(False) + + fig.savefig('paper/fig_comparison.pdf', bbox_inches='tight') +``` + +### Complete Example: Convergence/Trajectory Line Chart + +```python +with plt.style.context(style): + fig, ax = plt.subplots(figsize=(3.5, 2.5)) + + passes = np.arange(1, 16) + ours = [65, 72, 78, 82, 85, 87, 88, 89, 89.5, 90, 90, 90, 90, 90, 90] + baseline = [65, 68, 70, 71, 69, 67, 66, 65, 64, 63, 62, 61, 60, 59, 58] + + ax.plot(passes, ours, **STYLES[0], label='Ours', markersize=4) + ax.plot(passes, baseline, **STYLES[1], label='Critique+Revise', markersize=4) + + # Mark convergence point + ax.axvline(x=10, color='gray', linestyle=':', alpha=0.5, linewidth=0.8) + ax.annotate('Converged', xy=(10, 90), fontsize=8, ha='center', + xytext=(10, 93), arrowprops=dict(arrowstyle='->', color='gray')) + + ax.set_xlabel('Iteration') + ax.set_ylabel('Quality Score') + ax.legend(loc='lower right') + ax.spines['top'].set_visible(False) + ax.spines['right'].set_visible(False) + + fig.savefig('paper/fig_trajectory.pdf', bbox_inches='tight') +``` + +### Output Rules + +- **Always save as PDF**: `fig.savefig('fig.pdf')` — vector graphics, sharp at any zoom +- **Never save as PNG** for paper figures — raster PNGs look blurry when printed/zoomed +- **Exception**: Screenshots, photographs, or pixel-art visualizations → PNG at 600 DPI +- **Verify grayscale**: Print to grayscale PDF and check all information is still visible + +### Chart Types for Common Comparisons + +| Comparison Type | Chart | Notes | +|----------------|-------|-------| +| Method vs method | Grouped bar chart | Include error bars | +| Across model sizes | Line chart with CI bands | Log scale for model size axis | +| Ablation study | Stacked/grouped bar | Highlight removed component | +| Trajectory/convergence | Line chart over iterations | Show winner per iteration | +| Per-task breakdown | Heatmap or grouped bar | Show variance across tasks | diff --git a/skills/research/ml-paper-writing/references/reviewer-guidelines.md b/skills/research/research-paper-writing/references/reviewer-guidelines.md similarity index 75% rename from skills/research/ml-paper-writing/references/reviewer-guidelines.md rename to skills/research/research-paper-writing/references/reviewer-guidelines.md index 17e7cf0f7..415dc33f3 100644 --- a/skills/research/ml-paper-writing/references/reviewer-guidelines.md +++ b/skills/research/research-paper-writing/references/reviewer-guidelines.md @@ -105,7 +105,7 @@ Reviewers are explicitly instructed to: - Penalizing authors for honest limitation acknowledgment - Rejecting for missing citations to reviewer's own work -### Timeline (NeurIPS 2025) +### Timeline (NeurIPS 2025 — verify dates for current year) - Bidding: May 17-21 - Reviewing period: May 29 - July 2 @@ -113,6 +113,8 @@ Reviewers are explicitly instructed to: - Discussion period: July 31 - August 13 - Final notifications: September 18 +> **Note**: These dates are from the 2025 cycle. Always check the current year's call for papers at the venue website. + --- ## ICML Reviewer Guidelines @@ -198,6 +200,70 @@ ACL has a dedicated ethics review process for: --- +## AAAI Reviewer Guidelines + +### Evaluation Criteria + +AAAI reviewers evaluate along similar axes to NeurIPS/ICML but with some differences: + +| Criterion | Weight | Notes | +|-----------|--------|-------| +| **Technical quality** | High | Soundness of approach, correctness of results | +| **Significance** | High | Importance of the problem and contribution | +| **Novelty** | Medium-High | New ideas, methods, or insights | +| **Clarity** | Medium | Clear writing, well-organized presentation | +| **Reproducibility** | Medium | Sufficient detail to reproduce results | + +### AAAI-Specific Considerations + +- **Broader AI scope**: AAAI covers all of AI, not just ML. Papers on planning, reasoning, knowledge representation, NLP, vision, robotics, and multi-agent systems are all in scope. Reviewers may not be deep ML specialists. +- **Formatting strictness**: AAAI reviewers are instructed to flag formatting violations. Non-compliant papers may be desk-rejected before review. +- **Application papers**: AAAI is more receptive to application-focused work than NeurIPS/ICML. Framing a strong application contribution is viable. +- **Senior Program Committee**: AAAI uses SPCs (Senior Program Committee members) who mediate between reviewers and make accept/reject recommendations. + +### Scoring (AAAI Scale) + +- **Strong Accept**: Clearly above threshold, excellent contribution +- **Accept**: Above threshold, good contribution with minor issues +- **Weak Accept**: Borderline, merits outweigh concerns +- **Weak Reject**: Borderline, concerns outweigh merits +- **Reject**: Below threshold, significant issues +- **Strong Reject**: Well below threshold + +--- + +## COLM Reviewer Guidelines + +### Evaluation Criteria + +COLM reviews focus on relevance to language modeling in addition to standard criteria: + +| Criterion | Weight | Notes | +|-----------|--------|-------| +| **Relevance** | High | Must be relevant to language modeling community | +| **Technical quality** | High | Sound methodology, well-supported claims | +| **Novelty** | Medium-High | New insights about language models | +| **Clarity** | Medium | Clear presentation, reproducible | +| **Significance** | Medium-High | Impact on LM research and practice | + +### COLM-Specific Considerations + +- **Language model focus**: Reviewers will assess whether the contribution advances understanding of language models. General ML contributions need explicit LM framing. +- **Newer venue norms**: COLM is newer than NeurIPS/ICML, so reviewer calibration varies more. Write more defensively — anticipate a wider range of reviewer expertise. +- **ICLR-derived process**: Review process is modeled on ICLR (open reviews, author response period, discussion among reviewers). +- **Broad interpretation of "language modeling"**: Includes training, evaluation, alignment, safety, efficiency, applications, theory, multimodality (if language is central), and social impact of LMs. + +### Scoring + +COLM uses an ICLR-style scoring system: +- **8-10**: Strong accept (top papers) +- **6-7**: Weak accept (solid contribution) +- **5**: Borderline +- **3-4**: Weak reject (below threshold) +- **1-2**: Strong reject + +--- + ## What Makes Reviews Strong ### Following Daniel Dennett's Rules diff --git a/skills/research/ml-paper-writing/references/sources.md b/skills/research/research-paper-writing/references/sources.md similarity index 100% rename from skills/research/ml-paper-writing/references/sources.md rename to skills/research/research-paper-writing/references/sources.md diff --git a/skills/research/ml-paper-writing/references/writing-guide.md b/skills/research/research-paper-writing/references/writing-guide.md similarity index 99% rename from skills/research/ml-paper-writing/references/writing-guide.md rename to skills/research/research-paper-writing/references/writing-guide.md index 3da7233b6..1177336b7 100644 --- a/skills/research/ml-paper-writing/references/writing-guide.md +++ b/skills/research/research-paper-writing/references/writing-guide.md @@ -225,8 +225,6 @@ Provide context before asking the reader to consider anything new. This applies --- ---- - ## Micro-Level Writing Tips ### From Ethan Perez (Anthropic) diff --git a/skills/research/ml-paper-writing/templates/README.md b/skills/research/research-paper-writing/templates/README.md similarity index 100% rename from skills/research/ml-paper-writing/templates/README.md rename to skills/research/research-paper-writing/templates/README.md diff --git a/skills/research/ml-paper-writing/templates/aaai2026/README.md b/skills/research/research-paper-writing/templates/aaai2026/README.md similarity index 100% rename from skills/research/ml-paper-writing/templates/aaai2026/README.md rename to skills/research/research-paper-writing/templates/aaai2026/README.md diff --git a/skills/research/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex b/skills/research/research-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex rename to skills/research/research-paper-writing/templates/aaai2026/aaai2026-unified-supp.tex diff --git a/skills/research/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex b/skills/research/research-paper-writing/templates/aaai2026/aaai2026-unified-template.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/aaai2026/aaai2026-unified-template.tex rename to skills/research/research-paper-writing/templates/aaai2026/aaai2026-unified-template.tex diff --git a/skills/research/ml-paper-writing/templates/aaai2026/aaai2026.bib b/skills/research/research-paper-writing/templates/aaai2026/aaai2026.bib similarity index 100% rename from skills/research/ml-paper-writing/templates/aaai2026/aaai2026.bib rename to skills/research/research-paper-writing/templates/aaai2026/aaai2026.bib diff --git a/skills/research/ml-paper-writing/templates/aaai2026/aaai2026.bst b/skills/research/research-paper-writing/templates/aaai2026/aaai2026.bst similarity index 100% rename from skills/research/ml-paper-writing/templates/aaai2026/aaai2026.bst rename to skills/research/research-paper-writing/templates/aaai2026/aaai2026.bst diff --git a/skills/research/ml-paper-writing/templates/aaai2026/aaai2026.sty b/skills/research/research-paper-writing/templates/aaai2026/aaai2026.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/aaai2026/aaai2026.sty rename to skills/research/research-paper-writing/templates/aaai2026/aaai2026.sty diff --git a/skills/research/ml-paper-writing/templates/acl/README.md b/skills/research/research-paper-writing/templates/acl/README.md similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/README.md rename to skills/research/research-paper-writing/templates/acl/README.md diff --git a/skills/research/ml-paper-writing/templates/acl/acl.sty b/skills/research/research-paper-writing/templates/acl/acl.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/acl.sty rename to skills/research/research-paper-writing/templates/acl/acl.sty diff --git a/skills/research/ml-paper-writing/templates/acl/acl_latex.tex b/skills/research/research-paper-writing/templates/acl/acl_latex.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/acl_latex.tex rename to skills/research/research-paper-writing/templates/acl/acl_latex.tex diff --git a/skills/research/ml-paper-writing/templates/acl/acl_lualatex.tex b/skills/research/research-paper-writing/templates/acl/acl_lualatex.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/acl_lualatex.tex rename to skills/research/research-paper-writing/templates/acl/acl_lualatex.tex diff --git a/skills/research/ml-paper-writing/templates/acl/acl_natbib.bst b/skills/research/research-paper-writing/templates/acl/acl_natbib.bst similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/acl_natbib.bst rename to skills/research/research-paper-writing/templates/acl/acl_natbib.bst diff --git a/skills/research/ml-paper-writing/templates/acl/anthology.bib.txt b/skills/research/research-paper-writing/templates/acl/anthology.bib.txt similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/anthology.bib.txt rename to skills/research/research-paper-writing/templates/acl/anthology.bib.txt diff --git a/skills/research/ml-paper-writing/templates/acl/custom.bib b/skills/research/research-paper-writing/templates/acl/custom.bib similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/custom.bib rename to skills/research/research-paper-writing/templates/acl/custom.bib diff --git a/skills/research/ml-paper-writing/templates/acl/formatting.md b/skills/research/research-paper-writing/templates/acl/formatting.md similarity index 100% rename from skills/research/ml-paper-writing/templates/acl/formatting.md rename to skills/research/research-paper-writing/templates/acl/formatting.md diff --git a/skills/research/ml-paper-writing/templates/colm2025/README.md b/skills/research/research-paper-writing/templates/colm2025/README.md similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/README.md rename to skills/research/research-paper-writing/templates/colm2025/README.md diff --git a/skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.bib b/skills/research/research-paper-writing/templates/colm2025/colm2025_conference.bib similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.bib rename to skills/research/research-paper-writing/templates/colm2025/colm2025_conference.bib diff --git a/skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.bst b/skills/research/research-paper-writing/templates/colm2025/colm2025_conference.bst similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.bst rename to skills/research/research-paper-writing/templates/colm2025/colm2025_conference.bst diff --git a/skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.pdf b/skills/research/research-paper-writing/templates/colm2025/colm2025_conference.pdf similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.pdf rename to skills/research/research-paper-writing/templates/colm2025/colm2025_conference.pdf diff --git a/skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.sty b/skills/research/research-paper-writing/templates/colm2025/colm2025_conference.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.sty rename to skills/research/research-paper-writing/templates/colm2025/colm2025_conference.sty diff --git a/skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.tex b/skills/research/research-paper-writing/templates/colm2025/colm2025_conference.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/colm2025_conference.tex rename to skills/research/research-paper-writing/templates/colm2025/colm2025_conference.tex diff --git a/skills/research/ml-paper-writing/templates/colm2025/fancyhdr.sty b/skills/research/research-paper-writing/templates/colm2025/fancyhdr.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/fancyhdr.sty rename to skills/research/research-paper-writing/templates/colm2025/fancyhdr.sty diff --git a/skills/research/ml-paper-writing/templates/colm2025/math_commands.tex b/skills/research/research-paper-writing/templates/colm2025/math_commands.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/math_commands.tex rename to skills/research/research-paper-writing/templates/colm2025/math_commands.tex diff --git a/skills/research/ml-paper-writing/templates/colm2025/natbib.sty b/skills/research/research-paper-writing/templates/colm2025/natbib.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/colm2025/natbib.sty rename to skills/research/research-paper-writing/templates/colm2025/natbib.sty diff --git a/skills/research/ml-paper-writing/templates/iclr2026/fancyhdr.sty b/skills/research/research-paper-writing/templates/iclr2026/fancyhdr.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/fancyhdr.sty rename to skills/research/research-paper-writing/templates/iclr2026/fancyhdr.sty diff --git a/skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib b/skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.bib similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.bib rename to skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.bib diff --git a/skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst b/skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.bst similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.bst rename to skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.bst diff --git a/skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf b/skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.pdf similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.pdf rename to skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.pdf diff --git a/skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty b/skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.sty rename to skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.sty diff --git a/skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex b/skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/iclr2026_conference.tex rename to skills/research/research-paper-writing/templates/iclr2026/iclr2026_conference.tex diff --git a/skills/research/ml-paper-writing/templates/iclr2026/math_commands.tex b/skills/research/research-paper-writing/templates/iclr2026/math_commands.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/math_commands.tex rename to skills/research/research-paper-writing/templates/iclr2026/math_commands.tex diff --git a/skills/research/ml-paper-writing/templates/iclr2026/natbib.sty b/skills/research/research-paper-writing/templates/iclr2026/natbib.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/iclr2026/natbib.sty rename to skills/research/research-paper-writing/templates/iclr2026/natbib.sty diff --git a/skills/research/ml-paper-writing/templates/icml2026/algorithm.sty b/skills/research/research-paper-writing/templates/icml2026/algorithm.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/algorithm.sty rename to skills/research/research-paper-writing/templates/icml2026/algorithm.sty diff --git a/skills/research/ml-paper-writing/templates/icml2026/algorithmic.sty b/skills/research/research-paper-writing/templates/icml2026/algorithmic.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/algorithmic.sty rename to skills/research/research-paper-writing/templates/icml2026/algorithmic.sty diff --git a/skills/research/ml-paper-writing/templates/icml2026/example_paper.bib b/skills/research/research-paper-writing/templates/icml2026/example_paper.bib similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/example_paper.bib rename to skills/research/research-paper-writing/templates/icml2026/example_paper.bib diff --git a/skills/research/ml-paper-writing/templates/icml2026/example_paper.pdf b/skills/research/research-paper-writing/templates/icml2026/example_paper.pdf similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/example_paper.pdf rename to skills/research/research-paper-writing/templates/icml2026/example_paper.pdf diff --git a/skills/research/ml-paper-writing/templates/icml2026/example_paper.tex b/skills/research/research-paper-writing/templates/icml2026/example_paper.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/example_paper.tex rename to skills/research/research-paper-writing/templates/icml2026/example_paper.tex diff --git a/skills/research/ml-paper-writing/templates/icml2026/fancyhdr.sty b/skills/research/research-paper-writing/templates/icml2026/fancyhdr.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/fancyhdr.sty rename to skills/research/research-paper-writing/templates/icml2026/fancyhdr.sty diff --git a/skills/research/ml-paper-writing/templates/icml2026/icml2026.bst b/skills/research/research-paper-writing/templates/icml2026/icml2026.bst similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/icml2026.bst rename to skills/research/research-paper-writing/templates/icml2026/icml2026.bst diff --git a/skills/research/ml-paper-writing/templates/icml2026/icml2026.sty b/skills/research/research-paper-writing/templates/icml2026/icml2026.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/icml2026.sty rename to skills/research/research-paper-writing/templates/icml2026/icml2026.sty diff --git a/skills/research/ml-paper-writing/templates/icml2026/icml_numpapers.pdf b/skills/research/research-paper-writing/templates/icml2026/icml_numpapers.pdf similarity index 100% rename from skills/research/ml-paper-writing/templates/icml2026/icml_numpapers.pdf rename to skills/research/research-paper-writing/templates/icml2026/icml_numpapers.pdf diff --git a/skills/research/ml-paper-writing/templates/neurips2025/Makefile b/skills/research/research-paper-writing/templates/neurips2025/Makefile similarity index 100% rename from skills/research/ml-paper-writing/templates/neurips2025/Makefile rename to skills/research/research-paper-writing/templates/neurips2025/Makefile diff --git a/skills/research/ml-paper-writing/templates/neurips2025/extra_pkgs.tex b/skills/research/research-paper-writing/templates/neurips2025/extra_pkgs.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/neurips2025/extra_pkgs.tex rename to skills/research/research-paper-writing/templates/neurips2025/extra_pkgs.tex diff --git a/skills/research/ml-paper-writing/templates/neurips2025/main.tex b/skills/research/research-paper-writing/templates/neurips2025/main.tex similarity index 100% rename from skills/research/ml-paper-writing/templates/neurips2025/main.tex rename to skills/research/research-paper-writing/templates/neurips2025/main.tex diff --git a/skills/research/ml-paper-writing/templates/neurips2025/neurips.sty b/skills/research/research-paper-writing/templates/neurips2025/neurips.sty similarity index 100% rename from skills/research/ml-paper-writing/templates/neurips2025/neurips.sty rename to skills/research/research-paper-writing/templates/neurips2025/neurips.sty